python - Pandas taking Cumulative Sum with Reset -


problem

i'm trying keep running total of consecutive timestamps (minute frequency). have way of taking cumulative sum , resetting on condition 2 columns not match, done loop. wondering if there way without loop.

code

cb_arbitrage['shift'] = cb_arbitrage.index.shift(1, freq='t') 

returns:

                        cccccccc     bbbbbbbb  cb_spread         shift timestamp                                                                    2017-07-07 18:23:00  2535.002000  2524.678462  10.323538 2017-07-07 18:24:00 2017-07-07 18:24:00  2535.007826  2523.297619  11.710207 2017-07-07 18:25:00 2017-07-07 18:25:00  2535.004167  2524.391000  10.613167 2017-07-07 18:26:00 2017-07-07 18:26:00  2534.300000  2521.838667  12.461333 2017-07-07 18:27:00 2017-07-07 18:27:00  2530.231429  2520.195625  10.035804 2017-07-07 18:28:00 2017-07-07 18:28:00  2529.444667  2518.782143  10.662524 2017-07-07 18:29:00 2017-07-07 18:29:00  2528.988000  2518.802963  10.185037 2017-07-07 18:30:00 2017-07-07 18:59:00  2514.403367  2526.473333  12.069966 2017-07-07 19:00:00 2017-07-07 19:01:00  2516.410000  2528.980000  12.570000 2017-07-07 19:02:00 

then following:

cb_arbitrage['shift'] = cb_arbitrage['shift'].shift(1) cb_arbitrage['shift'][0] = cb_arbitrage.index[0] cb_arbitrage['count'] = 0 

which returns:

                        cccccccc     bbbbbbbb  cb_spread               shift  count timestamp                                                                           2017-07-07 18:23:00  2535.002000  2524.678462  10.323538 2017-07-07 18:23:00      0 2017-07-07 18:24:00  2535.007826  2523.297619  11.710207 2017-07-07 18:24:00      0 2017-07-07 18:25:00  2535.004167  2524.391000  10.613167 2017-07-07 18:25:00      0 2017-07-07 18:26:00  2534.300000  2521.838667  12.461333 2017-07-07 18:26:00      0 2017-07-07 18:27:00  2530.231429  2520.195625  10.035804 2017-07-07 18:27:00      0 2017-07-07 18:28:00  2529.444667  2518.782143  10.662524 2017-07-07 18:28:00      0 2017-07-07 18:29:00  2528.988000  2518.802963  10.185037 2017-07-07 18:29:00      0 2017-07-07 18:59:00  2514.403367  2526.473333  12.069966 2017-07-07 18:30:00      0 2017-07-07 19:01:00  2516.410000  2528.980000  12.570000 2017-07-07 19:00:00      0 

then, loop calculate cumulative sum, reset:

count = 0 i, row in cb_arbitrage.iterrows():      if == cb_arbitrage.loc[i]['shift']:         count += 1         cb_arbitrage.set_value(i, 'count', count)     else:         count = 1         cb_arbitrage.set_value(i, 'count', count) 

which gives me expected result:

                        cccccccc     bbbbbbbb  cb_spread               shift  count timestamp                                                                           2017-07-07 18:23:00  2535.002000  2524.678462  10.323538 2017-07-07 18:23:00      1 2017-07-07 18:24:00  2535.007826  2523.297619  11.710207 2017-07-07 18:24:00      2 2017-07-07 18:25:00  2535.004167  2524.391000  10.613167 2017-07-07 18:25:00      3 2017-07-07 18:26:00  2534.300000  2521.838667  12.461333 2017-07-07 18:26:00      4 2017-07-07 18:27:00  2530.231429  2520.195625  10.035804 2017-07-07 18:27:00      5 2017-07-07 18:28:00  2529.444667  2518.782143  10.662524 2017-07-07 18:28:00      6 2017-07-07 18:29:00  2528.988000  2518.802963  10.185037 2017-07-07 18:29:00      7 2017-07-07 18:59:00  2514.403367  2526.473333  12.069966 2017-07-07 18:30:00      1 2017-07-07 19:01:00  2516.410000  2528.980000  12.570000 2017-07-07 19:00:00      1 2017-07-07 21:55:00  2499.904560  2510.814000  10.909440 2017-07-07 19:02:00      1 2017-07-07 21:56:00  2500.134615  2510.812857  10.678242 2017-07-07 21:56:00      2 

you can use diff method finds difference between current row , previous row. can check , see if difference equal 1 minute. here, there lots of trickery reset streaks within data.

we first take cumulative sum of boolean series, gets close want. reset series multiply cumulative sum series original boolean, since false evaluates 0.

s = cb_arbitrage.timestamp.diff() == pd.timedelta('1 minute') s1 = s.cumsum() s.mul(s1).diff().where(lambda x: x < 0).ffill().add(s1, fill_value=0) + 1  0     1.0 1     2.0 2     3.0 3     4.0 4     5.0 5     6.0 6     7.0 7     1.0 8     1.0 9     1.0 10    2.0 

Comments