我有一个看起来像这样的数据框:
TS TAGNAME VALUE
1 2019-11-25 09:00:00.000 TAG1 TRUE
2 2019-11-25 09:01:00.000 TAG2 TRUE
3 2019-11-25 09:02:00.000 TAG1 FALSE
4 2019-11-25 09:03:00.000 TAG2 FALSE
5 2019-11-25 09:04:00.000 TAG1 TRUE
6 2019-11-25 09:05:00.000 TAG1 FALSE
7 2019-11-25 09:06:00.000 TAG1 TRUE
8 2019-11-25 09:07:00.000 TAG2 TRUE
9 2019-11-25 09:08:00.000 TAG3 TRUE
10 2019-11-25 09:09:00.000 TAG1 FALSE
11 2019-11-25 09:10:00.000 TAG2 FALSE
12 2019-11-25 09:11:00.000 TAG3 FALSE
13 2019-11-25 09:12:00.000 TAG1 TRUE
14 2019-11-25 09:13:00.000 TAG1 TRUE
15 2019-11-25 09:14:00.000 TAG1 FALSE
我想在此数据框中添加一列,其中应包含同一TAGNAME的“下一个相反”值的时间戳。 因此,在此示例中,我想要这样:
TS TAGNAME VALUE
1 2019-11-25 09:00:00.000 TAG1 TRUE 2019-11-25 09:02:00.000 (from index 3)
2 2019-11-25 09:01:00.000 TAG2 TRUE 2019-11-25 09:03:00.000 (from index 4)
3 2019-11-25 09:02:00.000 TAG1 FALSE 2019-11-25 09:04:00.000 (from index 5)
4 2019-11-25 09:03:00.000 TAG2 FALSE 2019-11-25 09:07:00.000 (from index 8)
5 2019-11-25 09:04:00.000 TAG1 TRUE 2019-11-25 09:05:00.000 (from index 6)
6 2019-11-25 09:05:00.000 TAG1 FALSE 2019-11-25 09:06:00.000 (from index 7)
7 2019-11-25 09:06:00.000 TAG1 TRUE 2019-11-25 09:09:00.000 (from index 10)
8 2019-11-25 09:07:00.000 TAG2 TRUE 2019-11-25 09:10:00.000 (from index 11)
9 2019-11-25 09:08:00.000 TAG3 TRUE 2019-11-25 09:11:00.000 (from index 12)
10 2019-11-25 09:09:00.000 TAG1 FALSE 2019-11-25 09:12:00.000 (from index 13)
11 2019-11-25 09:10:00.000 TAG2 FALSE (EMPTY, no "next" found)
12 2019-11-25 09:11:00.000 TAG3 FALSE (EMPTY, no "next" found)
13 2019-11-25 09:12:00.000 TAG1 TRUE 2019-11-25 09:14:00.000 (from index 15)
14 2019-11-25 09:13:00.000 TAG1 TRUE 2019-11-25 09:14:00.000 (from index 15)
15 2019-11-25 09:14:00.000 TAG1 FALSE (EMPTY, no "next" found)
请注意,有些行甚至可能没有“伙伴”,有时对于一个相反的值,我可能会有两个或多个“ TRUE”或“ FALSE”。
最终目标是计算同一TAGNAME的每次值更改之间的时间间隔。
TS TAGNAME VALUE PARTNER TS INTERVAL
1 2019-1-25 09:00:00.000 TAG1 TRUE 2019-11-25 09:02:00.000 (from index 3) 00:02:00
2 2019-11-25 09:01:00.000 TAG2 TRUE 2019-11-25 09:03:00.000 (from index 4) 00:02:00
3 2019-11-25 09:02:00.000 TAG1 FALSE 2019-11-25 09:04:00.000 (from index 5) 00:02:00
4 2019-11-25 09:03:00.000 TAG2 FALSE 2019-11-25 09:07:00.000 (from index 8) 00:04:00
5 2019-11-25 09:04:00.000 TAG1 TRUE 2019-11-25 09:05:00.000 (from index 6) 00:01:00
6 2019-11-25 09:05:00.000 TAG1 FALSE 2019-11-25 09:06:00.000 (from index 7) 00:01:00
7 2019-11-25 09:06:00.000 TAG1 TRUE 2019-11-25 09:09:00.000 (from index 10) 00:03:00
8 2019-11-25 09:07:00.000 TAG2 TRUE 2019-11-25 09:10:00.000 (from index 11) 00:03:00
9 2019-11-25 09:08:00.000 TAG3 TRUE 2019-11-25 09:11:00.000 (from index 12) 00:03:00
10 2019-11-25 09:09:00.000 TAG1 FALSE 2019-11-25 09:12:00.000 (from index 13) 00:04:00
11 2019-11-25 09:10:00.000 TAG2 FALSE (EMPTY, no "next" found) NaN
12 2019-11-25 09:11:00.000 TAG3 FALSE (EMPTY, no "next" found) NaN
13 2019-11-25 09:12:00.000 TAG1 TRUE 2019-11-25 09:14:00.000 (from index 15) 00:02:00
14 2019-11-25 09:13:00.000 TAG1 TRUE 2019-11-25 09:14:00.000 (from index 15) 00:01:00
15 2019-11-25 09:14:00.000 TAG1 FALSE (EMPTY, no "next" found) NaN
我想我可以通过做一个好的旧的“ for-loop”来实现这一点,但是它看起来不是很理想,并且可能会很慢。 在熊猫中,有什么方法可以做到这一点吗?
答案 0 :(得分:1)
这是我的解决方法:
df['diff'] = (df.assign(c=df.groupby(['TAGNAME','VALUE']).cumcount())
.sort_values(['c'])
.groupby('TAGNAME').TS.shift(-1)
.sub(df['TS'])
)
输出:
TS TAGNAME VALUE diff
1.0 2019-11-25 09:00:00 TAG1 True 00:02:00
2.0 2019-11-25 09:01:00 TAG2 True 00:02:00
3.0 2019-11-25 09:02:00 TAG1 False 00:02:00
4.0 2019-11-25 09:03:00 TAG2 False 00:04:00
5.0 2019-11-25 09:04:00 TAG1 True 00:01:00
6.0 2019-11-25 09:05:00 TAG1 False 00:01:00
7.0 2019-11-25 09:06:00 TAG1 True 00:03:00
8.0 2019-11-25 09:07:00 TAG2 True 00:03:00
9.0 2019-11-25 09:08:00 TAG3 True 00:03:00
10.0 2019-11-25 09:09:00 TAG1 False 00:03:00
11.0 2019-11-25 09:10:00 TAG2 False NaT
12.0 2019-11-25 09:11:00 TAG3 False NaT
13.0 2019-11-25 09:12:00 TAG1 True 00:02:00
14.0 2019-11-25 09:13:00 TAG1 True NaT
15.0 2019-11-25 09:14:00 TAG1 False -1 days +23:59:00
说明:
# this creates a new column with the order of occurrence for each VALUE within TAGNAME
df.assign(c=df.groupby(['TAGNAME','VALUE']).cumcount())
# sort the data by the that order of occurrence
sort_values(['TAGNAME','c'])
# groupby and shift to get the corresponding timestamp
groupby('TAGNAME').TS.shift(-1)
# subtract the original timestamp
sub(df['TS'])
答案 1 :(得分:1)
可以通过熊猫合并来最佳解决问题。
如果df是您显示的数据帧,例如:
index = pd.date_range('25/11/2019 09:00:00.000', periods=15, freq='T')
df = pd.DataFrame()
df['TS'] = index
df['TAGNAME'] = ['TAG1', 'TAG2', 'TAG1', 'TAG2', 'TAG1', 'TAG1', 'TAG1', 'TAG2',
'TAG3', 'TAG1', 'TAG2', 'TAG3', 'TAG1', 'TAG1', 'TAG1']
df['VALUE'] = [True, True, False, False, True, False, True, True, True, False, False,
False, True, True, False]
&要将其转换为最终数据帧,如您所示,需要进行两次合并:
merged = df.merge(df, on = 'TAGNAME', how = 'outer')
# adding conditions
merged = merged[(merged['VALUE_x'] != merged['VALUE_y']) & (merged['TS_x'] <
merged['TS_y'])]
# Taking a minimum next time that matches the TAGNAME
merged = merged.groupby('TS_x').min().reset_index()[['TS_x', 'TS_y']]
#renaming columns for merge in the next step and to align with final required df
merged.rename(columns = {'TS_x': 'TS', 'TS_y':'PARTNER TS'}, inplace = True)
# finally appending to the main dataframe
df = df.merge(merged, on = 'TS', how = 'left')
# calculating the interval
df['INTERVAL'] = df['PARTNER TS'] - df['TS']