我有一个熊猫数据框,如下所示:
df:
ts src_ip dst_ip dst_p srs_pkts src_bytes dst_pkts dst_bytes
0 2019-12-16 15:59:55.621609 185.156.73.60 x.x.x.x 46011 1 40 0 0
1 2019-12-16 19:59:58.368135 185.156.73.60 x.x.x.x 2010 1 40 0 0
2 2019-12-16 15:59:57.108674 185.176.27.182 y.y.y.y 10549 1 40 0 0
3 2019-12-16 22:00:00.774090 89.248.160.193 y.y.y.y 3084 1 40 0 0
4 2019-12-16 11:59:58.927001 89.248.162.161 z.z.z.z 6072 1 40 0 0
我要实现的目标如下:
1 +按一小时间隔分组(使用“ ts”)。
2 +对于每个一小时的小组,请按以下几列进行分组:(src_ip,dst_ip,dst_p)
3 +使用其余各列的总和(src_pkts,src_bytes,dst_pkts,dst_bytes)进行汇总
4 +最后,将这些摘要合并到一个新的数据框中,如下所示:
new_df:
first_seen_at last_seen_at src_ip dst_ip dst_p total_src_pkts total_src_bytes total_dst_pkts total_dst_bytes
2019-12-16 15:00:00:00.000000 2019-12-16 16:00:00:00.000000 185.156.73.60 x.x.x.x 46011 2 80 0 0 #summary of one hour for one tuple of (src_ip, dst_ip, dst_p)
.
.
.
非常感谢您为上述操作提供的帮助。
答案 0 :(得分:2)
使用0.25+大熊猫,请尝试:
df_hour = df.groupby([pd.Grouper(freq='H', key='ts'), 'src_ip', 'dst_ip', 'dst_p']).sum()
df_out = df_hour.reset_index('ts').groupby(level=[0, 1, 2])\
.agg(first_seen_at=('ts', 'first'),
last_seen_at=('ts', 'last'),
total_src_pkts=('srs_pkts', 'sum'),
total_src_bytes=('src_bytes', 'sum'),
total_dst_pkts=('dst_pkts', 'sum'),
total_dst_bytes=('dst_bytes', 'sum'))\
.reset_index()
print(df_out)
输出:
src_ip dst_ip dst_p first_seen_at last_seen_at total_src_pkts total_src_bytes total_dst_pkts total_dst_bytes
0 185.156.73.60 x.x.x.x 2010 2019-12-16 19:00:00 2019-12-16 19:00:00 1 40 0 0
1 185.156.73.60 x.x.x.x 46011 2019-12-16 15:00:00 2019-12-16 15:00:00 1 40 0 0
2 185.176.27.182 y.y.y.y 10549 2019-12-16 15:00:00 2019-12-16 15:00:00 1 40 0 0
3 89.248.160.193 y.y.y.y 3084 2019-12-16 22:00:00 2019-12-16 22:00:00 1 40 0 0
4 89.248.162.161 z.z.z.z 6072 2019-12-16 11:00:00 2019-12-16 11:00:00 1 40 0 0