按条目分组DataFrame X彼此之间的时间量

时间:2015-11-30 15:30:52

标签: python pandas dataframe

我需要将这些条目组合在一起,其中一个和另一个之间的时间戳差异是X秒或小于每个设备的每个值的平均值。在下面的示例中,我有一个包含此数据的数据框,我需要按设备进行分组,条目之间的条目相隔60秒。

              Device            Timestamp  Value
0  30:8c:fb:a4:b9:8b  10/26/2015 22:50:15     34
1  30:8c:fb:a4:b9:8b  10/26/2015 22:50:46     34
2  c0:ee:fb:35:ec:cd  10/26/2015 22:50:50     33
3  c0:ee:fb:35:ec:cd  10/26/2015 22:50:51     32
4  30:8c:fb:a4:b9:8b  10/26/2015 22:51:15     34
5  30:8c:fb:a4:b9:8b  10/26/2015 22:51:47     32
6  c0:ee:fb:35:ec:cd  10/26/2015 22:52:38     38
7  30:8c:fb:a4:b9:8b  10/26/2015 22:54:46     34

这应该是生成的DataFrame

              Device           First_seen            Last_seen Average_value
0  30:8c:fb:a4:b9:8b  10/26/2015 22:50:15  10/26/2015 22:51:47          33,5
1  c0:ee:fb:35:ec:cd  10/26/2015 22:50:50  10/26/2015 22:50:51          32,5
2  c0:ee:fb:35:ec:cd  10/26/2015 22:52:38  10/26/2015 22:52:38            38
3  30:8c:fb:a4:b9:8b  10/26/2015 22:54:46  10/26/2015 22:54:46            34

我一直在尝试使用timeGrouper,但我无法找到可行的解决方案。非常感谢你的帮助。

1 个答案:

答案 0 :(得分:1)

您可以使用

diffs = df.groupby(['Device'])['Timestamp'].diff()
# In [39]: diffs
# Out[39]: 
# 0        NaT
# 1   00:00:31
# 2        NaT
# 3   00:00:01
# 4   00:00:29
# 5   00:00:32
# 6   00:01:47
# 7   00:02:59
# dtype: timedelta64[ns]

计算每个设备组的连续时间戳之间的差异。 请注意,这取决于按时间顺序排列的时间戳(至少在每个Device组中)。如果不是,您当然可以先按Timestamp对行进行排序,(例如df = df.sort('Timestamp')

然后创建一个布尔掩码来查找diff超过60秒的时间:

df['gap'] = diffs > pd.Timedelta(seconds=60)
# In [42]: df['gap']
# Out[42]: 
# 0    False
# 1    False
# 2    False
# 3    False
# 4    False
# 5    False
# 6     True
# 7     True
# Name: gap, dtype: bool

对于每个设备,我们都可以使用cumsum来计算df['gap']的累计总和。

df['group'] = df.groupby(['Device'])['gap'].cumsum()
# In [45]: df['group']
# Out[45]: 
# 0    0
# 1    0
# 2    0
# 3    0
# 4    0
# 5    0
# 6    1
# 7    1
# Name: group, dtype: int64

由于False被视为0而True被视为1,因此累积总和实际上为每个设备组中属于同一间隙组的行编号。

现在,我们可以对Devicegroup列进行分组,找到每个组中的第一个和最后一个Timestamp以及平均值Value

result = df.groupby(['Device', 'group']).agg(
             {'Timestamp': ['first','last'], 'Value':'mean'}):

#                                   Timestamp                     Value
#                                       first                last  mean
# Device            group                                              
# 30:8c:fb:a4:b9:8b 0     2015-10-26 22:50:15 2015-10-26 22:51:47  33.5
#                   1     2015-10-26 22:54:46 2015-10-26 22:54:46  34.0
# c0:ee:fb:35:ec:cd 0     2015-10-26 22:50:50 2015-10-26 22:50:51  32.5
#                   1     2015-10-26 22:52:38 2015-10-26 22:52:38  38.0

全部放在一起:

import pandas as pd

df = pd.DataFrame(
    {'Device': {0: '30:8c:fb:a4:b9:8b',
                1: '30:8c:fb:a4:b9:8b',
                2: 'c0:ee:fb:35:ec:cd',
                3: 'c0:ee:fb:35:ec:cd',
                4: '30:8c:fb:a4:b9:8b',
                5: '30:8c:fb:a4:b9:8b',
                6: 'c0:ee:fb:35:ec:cd',
                7: '30:8c:fb:a4:b9:8b'},
     'Timestamp': {0: pd.Timestamp('2015-10-26 22:50:15'),
                   1: pd.Timestamp('2015-10-26 22:50:46'),
                   2: pd.Timestamp('2015-10-26 22:50:50'),
                   3: pd.Timestamp('2015-10-26 22:50:51'),
                   4: pd.Timestamp('2015-10-26 22:51:15'),
                   5: pd.Timestamp('2015-10-26 22:51:47'),
                   6: pd.Timestamp('2015-10-26 22:52:38'),
                   7: pd.Timestamp('2015-10-26 22:54:46')},
     'Value': {0: 34, 1: 34, 2: 33, 3: 32, 4: 34, 5: 32, 6: 38, 7: 34}})

diffs = df.groupby(['Device'])['Timestamp'].diff()
df['gap'] = diffs > pd.Timedelta(seconds=60)
df['group'] = df.groupby(['Device'])['gap'].cumsum()
result = df.groupby(['Device', 'group']).agg({'Timestamp': ['first','last'], 'Value':'mean'})
print(result)