熊猫按2列分组,使用另一列查找增量

时间:2019-02-14 15:01:12

标签: python pandas pandas-groupby

我有一个熊猫数据帧,它有4909144行,其中time作为索引,source_namedest_addresstvalue与{{ 1}}索引。我已使用以下命令按timesource_namedest_address对df进行了排序,以便按时间将它们分组:

tvalue

哪个给我:

df = df.sort_values(by=['sourcehostname','destinationaddress','tvalue'])

我想要时间之间的时差,所以我使用:

                        source_name  dest_address   tvalue                 
time                
2019-02-06 15:00:54.000 source_1     72.21.215.90   2019-02-06 15:00:54.000 
2019-02-06 15:01:00.000 source_1     72.21.215.90   2019-02-06 15:01:00.000 
2019-02-06 15:30:51.000 source_1     72.21.215.90   2019-02-06 15:30:51.000 
2019-02-06 15:30:51.000 source_1     72.21.215.90   2019-02-06 15:30:51.000 
2019-02-06 15:00:54.000 source_1     131.107.0.89   2019-02-06 15:00:54.000 
2019-02-06 15:01:14.000 source_1     131.107.0.89   2019-02-06 15:01:14.000 
2019-02-06 15:03:02.000 source_2     69.63.191.1    2019-02-06 15:03:02.000 
2019-02-06 15:08:02.000 source_2     69.63.191.1    2019-02-06 15:08:02.000 

哪个给我:

#Create delta
df['delta'] = (df['tvalue']-df['tvalue'].shift()).fillna(0)

但是我想按 source_name dest_address tvalue delta time 2019-02-06 15:00:54.000 source_1 72.21.215.90 2019-02-06 15:00:54.000 00:00:00 2019-02-06 15:01:00.000 source_1 72.21.215.90 2019-02-06 15:01:00.000 00:00:06 2019-02-06 15:30:51.000 source_1 72.21.215.90 2019-02-06 15:30:51.000 00:29:51 2019-02-06 15:30:51.000 source_1 72.21.215.90 2019-02-06 15:30:51.000 00:00:00 2019-02-06 15:00:54.000 source_1 131.107.0.89 2019-02-06 15:00:54.000 -1 days +23:30:03 2019-02-06 15:01:14.000 source_1 131.107.0.89 2019-02-06 15:01:14.000 00:00:20 2019-02-06 15:03:02.000 source_2 69.63.191.1 2019-02-06 15:03:02.000 00:01:48 2019-02-06 15:08:02.000 source_2 69.63.191.1 2019-02-06 15:08:02.000 00:05:00 source_name分组并得到dest_address的差异,这样我就不会遇到{{ 1}}或tvalue之类的delta之类的-1 days +23:30:00,应该是delta

我正在尝试:

00:01:48

但这花费了很长时间,可能无法为我提供所需的结果。

以下内容不起作用,但是您可以像我的原始代码一样进行操作,但可以添加分组依据吗?:

source_2

1 个答案:

答案 0 :(得分:1)

import datetime as dt

source_changed = df['sourcehostname'] != df['sourcehostname'].shift()
dest_changed = df['destinationaddress'] != df['destinationaddress'].shift()
change_occurred = (source_changed | dest_changed)

time_diff = df['tvalue'].diff()

now = dt.datetime.utcnow()
zero_delta = now - now

df['time_diff'] = time_diff
df['change_occurred'] = change_occurred

# Then do a function
# If df['change_occurred'] is True -> set the value of df['delta'] to zero_delta  
# Else set df['delta'] to the value at df['time_dff']