使用滚动总和根据启动/停止数据识别某些事件

时间:2017-01-20 23:11:42

标签: python pandas

我正在尝试使用下面的代码从我的数据集water中提取特定事件。 (我下面的数据不是实际的数据集)。

目前,我的代码根据大于零的值之间是否存在零来对event进行分类。然后它将这些值相加并返回每个事件使用的总水量。但是,此代码当前将一个事件分类为两个事件,即使它们之间只有几秒钟的零。如果事件之间的零持续时间少于5秒,我想在同一事件中对它们进行分类。

我如何调整代码以检查事件之间的零点是否小于5秒,如果是,则在同一事件中对它们进行分类?

rng = pd.date_range('2017-01-01 14:00:00', '2017-01-01 14:01:00', freq='S')
water = [0,0,0.2,0.3,0.4,0,0,0.3,0.2,0.5]*6+[0]
df = pd.DataFrame({'time_stamp':rng,'water_amount':water})

starts = (df['water_amount']>0)&(df['water_amount'].shift(1)==0) #find all starts of events
n_events = sum(starts) #total number of events
df.loc[starts,'event_number'] = range(1,n_events+1) #numerate starts from 1 to n
df['event_number'] = df['event_number'].fillna(method='pad').fillna(-1) #forward fill all the values
df.loc[df['water_amount']==0,'event_number']=-1 #set all event numbers to -1 where the water amount is 0

df.groupby('event_number').agg({'time_stamp':'first',
                                    'water_amount':'sum'}) #feature matrix

编辑: 照片说明了我的问题:

enter image description here

2 个答案:

答案 0 :(得分:1)

不知道熊猫,但我确实有一些原型代码

构建索引列表,数据元组,每个子列表是连续的测量运行,打破任何0数据

import itertools as it

water = [0, 0.1, 0.2, 0, 0, 0.5, 0, 0, 0, 0.9, 1.0]*2+[0]

i_water=((i,e) for i,e in enumerate(water))

chunk_i_water = [[i] + 
                 [e for e in it.takewhile(lambda x: x[1] != 0, i_water)] 
                 for i in i_water if i[1] != 0]

print('chunked: ', *chunk_i_water, sep='\n')
print('\n')

chunked: 
[(1, 0.1), (2, 0.2)]
[(5, 0.5)]
[(9, 0.9), (10, 1.0)]
[(12, 0.1), (13, 0.2)]
[(16, 0.5)]
[(20, 0.9), (21, 1.0)]

然后通过索引

中的diff合并子列表

您的应用程序将使用索引来调用事件时间,而不是测试时间差异

def MergeOnDiff(a, diff):
    b = [a[0]]
    for i in range(len(a)-1):
        if a[i+1][0][0] - a[i][-1][0] < diff+1:
            b[-1] += a[i+1]
        else:
            b.append(a[i+1])
    return b

diff = 3
b = MergeOnDiff(chunk_i_water, diff)

print('merged with diff = ', diff, *b, sep='\n')    

merged with diff = 
3
[(1, 0.1), (2, 0.2), (5, 0.5)]
[(9, 0.9), (10, 1.0), (12, 0.1), (13, 0.2), (16, 0.5)]
[(20, 0.9), (21, 1.0)]

# change diff:

merged with diff = 
2
[(1, 0.1), (2, 0.2)]
[(5, 0.5)]
[(9, 0.9), (10, 1.0), (12, 0.1), (13, 0.2)]
[(16, 0.5)]
[(20, 0.9), (21, 1.0)]

获取事件平均值,子列表中的索引范围很容易

for e in b:
    ave = sum((d[1] for d in e)) / len(e)
    print('for irange' , e[0][0], 'to', e[-1][0], 'ave = ', ave )

for irange 1 to 2 ave =  0.15000000000000002
for irange 5 to 5 ave =  0.5
for irange 9 to 13 ave =  0.55
for irange 16 to 16 ave =  0.5
for irange 20 to 21 ave =  0.95

再次,您将使用索引来查找事件开始/停止时间

答案 1 :(得分:1)

这是首先标记每组连续零的解决方案。然后计算该组中有多少个零,并确定是否有少于5.然后标记整个水系列仅在找到5个连续零时才增加组编号。

正确标记组后,汇总很容易。

多组假冒伪劣数据

rng = pd.date_range('2017-01-01 14:00:00', '2017-01-01 14:01:00', freq='S')
water = [0,.2,.3,0,0,.4,0,0.3,0.2,0.5,0,0,0,0,0,1,3,4,0,0,0,0,5,4,0,0,2,4,0,0,
         0,0,0,1,0,0,0,0,0,1,.4,.3,.1,.4,0,0,0,4,5,0,1,0,0,0,0,0,5,1,2,0,0]
df = pd.DataFrame({'time_stamp':rng,'water_amount':water})
water = df.water_amount

分组并汇总

groups = water.ne(0).diff().fillna(0 == water.iloc[0]).cumsum().mul(water.eq(0))

counts = groups.value_counts()
counts.loc[0] = 0

groups5 = groups.map(counts).lt(5)

groups_final = groups5.diff().cumsum().fillna(0).add(1).mul(groups5).astype(int)

df_agg = df.groupby(groups_final).agg({'time_stamp':['first', 'last'],
                                    'water_amount':'sum'}).drop(0) 
df_agg.index.set_names(['Group Number'], inplace=True)

print(df_agg)

的输出
                      time_stamp                     water_amount
                           first                last          sum
Group Number                                                     
1            2017-01-01 14:00:00 2017-01-01 14:00:09          1.9
3            2017-01-01 14:00:15 2017-01-01 14:00:27         23.0
5            2017-01-01 14:00:33 2017-01-01 14:00:33          1.0
7            2017-01-01 14:00:39 2017-01-01 14:00:50         12.2
9            2017-01-01 14:00:56 2017-01-01 14:01:00          8.0