我正在尝试使用下面的代码从我的数据集water
中提取特定事件。 (我下面的数据不是实际的数据集)。
目前,我的代码根据大于零的值之间是否存在零来对event
进行分类。然后它将这些值相加并返回每个事件使用的总水量。但是,此代码当前将一个事件分类为两个事件,即使它们之间只有几秒钟的零。如果事件之间的零持续时间少于5秒,我想在同一事件中对它们进行分类。
我如何调整代码以检查事件之间的零点是否小于5秒,如果是,则在同一事件中对它们进行分类?
rng = pd.date_range('2017-01-01 14:00:00', '2017-01-01 14:01:00', freq='S')
water = [0,0,0.2,0.3,0.4,0,0,0.3,0.2,0.5]*6+[0]
df = pd.DataFrame({'time_stamp':rng,'water_amount':water})
starts = (df['water_amount']>0)&(df['water_amount'].shift(1)==0) #find all starts of events
n_events = sum(starts) #total number of events
df.loc[starts,'event_number'] = range(1,n_events+1) #numerate starts from 1 to n
df['event_number'] = df['event_number'].fillna(method='pad').fillna(-1) #forward fill all the values
df.loc[df['water_amount']==0,'event_number']=-1 #set all event numbers to -1 where the water amount is 0
df.groupby('event_number').agg({'time_stamp':'first',
'water_amount':'sum'}) #feature matrix
编辑: 照片说明了我的问题:
答案 0 :(得分:1)
不知道熊猫,但我确实有一些原型代码
构建索引列表,数据元组,每个子列表是连续的测量运行,打破任何0数据
import itertools as it
water = [0, 0.1, 0.2, 0, 0, 0.5, 0, 0, 0, 0.9, 1.0]*2+[0]
i_water=((i,e) for i,e in enumerate(water))
chunk_i_water = [[i] +
[e for e in it.takewhile(lambda x: x[1] != 0, i_water)]
for i in i_water if i[1] != 0]
print('chunked: ', *chunk_i_water, sep='\n')
print('\n')
chunked:
[(1, 0.1), (2, 0.2)]
[(5, 0.5)]
[(9, 0.9), (10, 1.0)]
[(12, 0.1), (13, 0.2)]
[(16, 0.5)]
[(20, 0.9), (21, 1.0)]
然后通过索引
中的diff合并子列表您的应用程序将使用索引来调用事件时间,而不是测试时间差异
def MergeOnDiff(a, diff):
b = [a[0]]
for i in range(len(a)-1):
if a[i+1][0][0] - a[i][-1][0] < diff+1:
b[-1] += a[i+1]
else:
b.append(a[i+1])
return b
diff = 3
b = MergeOnDiff(chunk_i_water, diff)
print('merged with diff = ', diff, *b, sep='\n')
merged with diff =
3
[(1, 0.1), (2, 0.2), (5, 0.5)]
[(9, 0.9), (10, 1.0), (12, 0.1), (13, 0.2), (16, 0.5)]
[(20, 0.9), (21, 1.0)]
# change diff:
merged with diff =
2
[(1, 0.1), (2, 0.2)]
[(5, 0.5)]
[(9, 0.9), (10, 1.0), (12, 0.1), (13, 0.2)]
[(16, 0.5)]
[(20, 0.9), (21, 1.0)]
获取事件平均值,子列表中的索引范围很容易
for e in b:
ave = sum((d[1] for d in e)) / len(e)
print('for irange' , e[0][0], 'to', e[-1][0], 'ave = ', ave )
for irange 1 to 2 ave = 0.15000000000000002
for irange 5 to 5 ave = 0.5
for irange 9 to 13 ave = 0.55
for irange 16 to 16 ave = 0.5
for irange 20 to 21 ave = 0.95
再次,您将使用索引来查找事件开始/停止时间
答案 1 :(得分:1)
这是首先标记每组连续零的解决方案。然后计算该组中有多少个零,并确定是否有少于5.然后标记整个水系列仅在找到5个连续零时才增加组编号。
正确标记组后,汇总很容易。
rng = pd.date_range('2017-01-01 14:00:00', '2017-01-01 14:01:00', freq='S')
water = [0,.2,.3,0,0,.4,0,0.3,0.2,0.5,0,0,0,0,0,1,3,4,0,0,0,0,5,4,0,0,2,4,0,0,
0,0,0,1,0,0,0,0,0,1,.4,.3,.1,.4,0,0,0,4,5,0,1,0,0,0,0,0,5,1,2,0,0]
df = pd.DataFrame({'time_stamp':rng,'water_amount':water})
water = df.water_amount
groups = water.ne(0).diff().fillna(0 == water.iloc[0]).cumsum().mul(water.eq(0))
counts = groups.value_counts()
counts.loc[0] = 0
groups5 = groups.map(counts).lt(5)
groups_final = groups5.diff().cumsum().fillna(0).add(1).mul(groups5).astype(int)
df_agg = df.groupby(groups_final).agg({'time_stamp':['first', 'last'],
'water_amount':'sum'}).drop(0)
df_agg.index.set_names(['Group Number'], inplace=True)
print(df_agg)
time_stamp water_amount
first last sum
Group Number
1 2017-01-01 14:00:00 2017-01-01 14:00:09 1.9
3 2017-01-01 14:00:15 2017-01-01 14:00:27 23.0
5 2017-01-01 14:00:33 2017-01-01 14:00:33 1.0
7 2017-01-01 14:00:39 2017-01-01 14:00:50 12.2
9 2017-01-01 14:00:56 2017-01-01 14:01:00 8.0