我需要将这些条目组合在一起,其中一个和另一个之间的时间戳差异是X秒或小于每个设备的每个值的平均值。在下面的示例中,我有一个包含此数据的数据框,我需要按设备进行分组,条目之间的条目相隔60秒。
Device Timestamp Value
0 30:8c:fb:a4:b9:8b 10/26/2015 22:50:15 34
1 30:8c:fb:a4:b9:8b 10/26/2015 22:50:46 34
2 c0:ee:fb:35:ec:cd 10/26/2015 22:50:50 33
3 c0:ee:fb:35:ec:cd 10/26/2015 22:50:51 32
4 30:8c:fb:a4:b9:8b 10/26/2015 22:51:15 34
5 30:8c:fb:a4:b9:8b 10/26/2015 22:51:47 32
6 c0:ee:fb:35:ec:cd 10/26/2015 22:52:38 38
7 30:8c:fb:a4:b9:8b 10/26/2015 22:54:46 34
这应该是生成的DataFrame
Device First_seen Last_seen Average_value
0 30:8c:fb:a4:b9:8b 10/26/2015 22:50:15 10/26/2015 22:51:47 33,5
1 c0:ee:fb:35:ec:cd 10/26/2015 22:50:50 10/26/2015 22:50:51 32,5
2 c0:ee:fb:35:ec:cd 10/26/2015 22:52:38 10/26/2015 22:52:38 38
3 30:8c:fb:a4:b9:8b 10/26/2015 22:54:46 10/26/2015 22:54:46 34
我一直在尝试使用timeGrouper,但我无法找到可行的解决方案。非常感谢你的帮助。
答案 0 :(得分:1)
您可以使用
diffs = df.groupby(['Device'])['Timestamp'].diff()
# In [39]: diffs
# Out[39]:
# 0 NaT
# 1 00:00:31
# 2 NaT
# 3 00:00:01
# 4 00:00:29
# 5 00:00:32
# 6 00:01:47
# 7 00:02:59
# dtype: timedelta64[ns]
计算每个设备组的连续时间戳之间的差异。
请注意,这取决于按时间顺序排列的时间戳(至少在每个Device
组中)。如果不是,您当然可以先按Timestamp
对行进行排序,(例如df = df.sort('Timestamp')
)
然后创建一个布尔掩码来查找diff超过60秒的时间:
df['gap'] = diffs > pd.Timedelta(seconds=60)
# In [42]: df['gap']
# Out[42]:
# 0 False
# 1 False
# 2 False
# 3 False
# 4 False
# 5 False
# 6 True
# 7 True
# Name: gap, dtype: bool
对于每个设备,我们都可以使用cumsum
来计算df['gap']
的累计总和。
df['group'] = df.groupby(['Device'])['gap'].cumsum()
# In [45]: df['group']
# Out[45]:
# 0 0
# 1 0
# 2 0
# 3 0
# 4 0
# 5 0
# 6 1
# 7 1
# Name: group, dtype: int64
由于False被视为0而True被视为1,因此累积总和实际上为每个设备组中属于同一间隙组的行编号。
现在,我们可以对Device
和group
列进行分组,找到每个组中的第一个和最后一个Timestamp
以及平均值Value
:
result = df.groupby(['Device', 'group']).agg(
{'Timestamp': ['first','last'], 'Value':'mean'}):
# Timestamp Value
# first last mean
# Device group
# 30:8c:fb:a4:b9:8b 0 2015-10-26 22:50:15 2015-10-26 22:51:47 33.5
# 1 2015-10-26 22:54:46 2015-10-26 22:54:46 34.0
# c0:ee:fb:35:ec:cd 0 2015-10-26 22:50:50 2015-10-26 22:50:51 32.5
# 1 2015-10-26 22:52:38 2015-10-26 22:52:38 38.0
全部放在一起:
import pandas as pd
df = pd.DataFrame(
{'Device': {0: '30:8c:fb:a4:b9:8b',
1: '30:8c:fb:a4:b9:8b',
2: 'c0:ee:fb:35:ec:cd',
3: 'c0:ee:fb:35:ec:cd',
4: '30:8c:fb:a4:b9:8b',
5: '30:8c:fb:a4:b9:8b',
6: 'c0:ee:fb:35:ec:cd',
7: '30:8c:fb:a4:b9:8b'},
'Timestamp': {0: pd.Timestamp('2015-10-26 22:50:15'),
1: pd.Timestamp('2015-10-26 22:50:46'),
2: pd.Timestamp('2015-10-26 22:50:50'),
3: pd.Timestamp('2015-10-26 22:50:51'),
4: pd.Timestamp('2015-10-26 22:51:15'),
5: pd.Timestamp('2015-10-26 22:51:47'),
6: pd.Timestamp('2015-10-26 22:52:38'),
7: pd.Timestamp('2015-10-26 22:54:46')},
'Value': {0: 34, 1: 34, 2: 33, 3: 32, 4: 34, 5: 32, 6: 38, 7: 34}})
diffs = df.groupby(['Device'])['Timestamp'].diff()
df['gap'] = diffs > pd.Timedelta(seconds=60)
df['group'] = df.groupby(['Device'])['gap'].cumsum()
result = df.groupby(['Device', 'group']).agg({'Timestamp': ['first','last'], 'Value':'mean'})
print(result)