Question

我有一个熊猫数据框，如下所示：

    df:

    ts                          src_ip          dst_ip          dst_p   srs_pkts src_bytes  dst_pkts    dst_bytes
0   2019-12-16 15:59:55.621609  185.156.73.60   x.x.x.x         46011   1        40         0           0
1   2019-12-16 19:59:58.368135  185.156.73.60   x.x.x.x         2010    1        40         0           0
2   2019-12-16 15:59:57.108674  185.176.27.182  y.y.y.y         10549   1        40         0           0
3   2019-12-16 22:00:00.774090  89.248.160.193  y.y.y.y         3084    1        40         0           0
4   2019-12-16 11:59:58.927001  89.248.162.161  z.z.z.z         6072    1        40         0           0

我要实现的目标如下：

1 +按一小时间隔分组（使用“ ts”）。

2 +对于每个一小时的小组，请按以下几列进行分组：（src_ip，dst_ip，dst_p）

3 +使用其余各列的总和（src_pkts，src_bytes，dst_pkts，dst_bytes）进行汇总

4 +最后，将这些摘要合并到一个新的数据框中，如下所示：

new_df:

first_seen_at                    last_seen_at                       src_ip           dst_ip    dst_p    total_src_pkts    total_src_bytes    total_dst_pkts    total_dst_bytes
2019-12-16 15:00:00:00.000000    2019-12-16 16:00:00:00.000000      185.156.73.60    x.x.x.x   46011    2                 80                 0                 0               #summary of one hour for one tuple of (src_ip, dst_ip, dst_p)
.
.
.

非常感谢您为上述操作提供的帮助。

Answer 1

使用0.25+大熊猫，请尝试：

df_hour = df.groupby([pd.Grouper(freq='H', key='ts'), 'src_ip', 'dst_ip', 'dst_p']).sum()
df_out = df_hour.reset_index('ts').groupby(level=[0, 1, 2])\
                .agg(first_seen_at=('ts', 'first'),
                     last_seen_at=('ts', 'last'),
                     total_src_pkts=('srs_pkts', 'sum'),
                     total_src_bytes=('src_bytes', 'sum'),
                     total_dst_pkts=('dst_pkts', 'sum'),
                     total_dst_bytes=('dst_bytes', 'sum'))\
                .reset_index()
print(df_out)

输出：

           src_ip   dst_ip  dst_p       first_seen_at        last_seen_at  total_src_pkts  total_src_bytes  total_dst_pkts  total_dst_bytes
0   185.156.73.60  x.x.x.x   2010 2019-12-16 19:00:00 2019-12-16 19:00:00               1               40               0                0
1   185.156.73.60  x.x.x.x  46011 2019-12-16 15:00:00 2019-12-16 15:00:00               1               40               0                0
2  185.176.27.182  y.y.y.y  10549 2019-12-16 15:00:00 2019-12-16 15:00:00               1               40               0                0
3  89.248.160.193  y.y.y.y   3084 2019-12-16 22:00:00 2019-12-16 22:00:00               1               40               0                0
4  89.248.162.161  z.z.z.z   6072 2019-12-16 11:00:00 2019-12-16 11:00:00               1               40               0                0

大熊猫数据框按日期间隔汇总

1 个答案: