优雅的groupby和熊猫更新?

时间:2015-11-09 23:50:28

标签: python pandas

我有以下pandas.DataFrame对象:

       offset                      ts               op    time
0    0.000000 2015-10-27 18:31:40.318       Decompress   2.953
1    0.000000 2015-10-27 18:31:40.318  DeserializeBond   0.015
32   0.000000 2015-10-27 18:31:40.318         Compress  17.135
33   0.000000 2015-10-27 18:31:40.318       BuildIndex  19.494
34   0.000000 2015-10-27 18:31:40.318      InsertIndex   0.625
35   0.000000 2015-10-27 18:31:40.318         Compress  16.970
36   0.000000 2015-10-27 18:31:40.318       BuildIndex  18.954
37   0.000000 2015-10-27 18:31:40.318      InsertIndex   0.047
38   0.000000 2015-10-27 18:31:40.318         Compress  16.017
39   0.000000 2015-10-27 18:31:40.318       BuildIndex  17.814
40   0.000000 2015-10-27 18:31:40.318      InsertIndex   0.047
77   4.960683 2015-10-27 18:36:37.959       Decompress   2.844
78   4.960683 2015-10-27 18:36:37.959  DeserializeBond   0.000
108  4.960683 2015-10-27 18:36:37.959         Compress  17.758
109  4.960683 2015-10-27 18:36:37.959       BuildIndex  19.742
110  4.960683 2015-10-27 18:36:37.959      InsertIndex   0.110
111  4.960683 2015-10-27 18:36:37.959         Compress  16.267
112  4.960683 2015-10-27 18:36:37.959       BuildIndex  18.111
113  4.960683 2015-10-27 18:36:37.959      InsertIndex   0.062

我想按(offset, ts, op)字段进行分组,并总结time个值:

df = df.groupby(['offset', 'ts', 'op']).sum()

到目前为止一切顺利:

                                                    time
offset   ts                      op                     
0.000000 2015-10-27 18:31:40.318 BuildIndex       56.262
                                 Compress         50.122
                                 Decompress        2.953
                                 DeserializeBond   0.015
                                 InsertIndex       0.719
4.960683 2015-10-27 18:36:37.959 BuildIndex       37.853
                                 Compress         34.025
                                 Decompress        2.844
                                 DeserializeBond   0.000
                                 InsertIndex       0.172

问题是,我必须从每个组中的Compress - 中减去BuildIndex时间I was recommended使用DataFrame.xs(),我想出了以下内容:

diff = df.xs("BuildIndex", level="op") - df.xs("Compress", level="op")
diff['op'] = 'BuildIndex'
diff = diff.reset_index().groupby(['offset', 'ts', 'op']).agg(lambda val: val)
df.update(diff)

它完成了这项工作,但我强烈认为必须有一个更优雅的解决方案。

有人可以建议更好的方法吗?

1 个答案:

答案 0 :(得分:1)

注意:您的行:

diff = diff.reset_index().groupby(['offset', 'ts', 'op']).agg(lambda val: val)

不必要,因为差异不变(因为它已经通过以前的groupby而独特)。

一点点黑客就是使用drop_levels=False和.values(因此在减去时会忽略索引),这有点厚颜无耻,因为它假设每个组都有一个" BuildIndex"和一个" op"行,可能不安全。

In [11]: diff = df1.xs("BuildIndex", level="op", drop_level=False) - df1.xs("Compress", level="op").values

In [12]: diff
Out[12]:
                                     time
offset     ts           op
2015-10-27 18:31:40.318 BuildIndex  6.140
           18:36:37.959 BuildIndex  3.828

我很想在这里取消堆叠,因为数据真的是二维的:

In [21]: res = df1.unstack("op")

In [22]: res
Out[22]:
                              time
op                      BuildIndex Compress Decompress DeserializeBond InsertIndex
offset     ts
2015-10-27 18:31:40.318     56.262   50.122      2.953           0.015       0.719
           18:36:37.959     37.853   34.025      2.844           0.000       0.172

目前还不清楚这是否是MultiIndex列中的值:

In [23]: res.columns = res.columns.get_level_values(1)

In [24]: res
Out[24]:
op                       BuildIndex  Compress  Decompress  DeserializeBond  InsertIndex
offset     ts
2015-10-27 18:31:40.318      56.262    50.122       2.953            0.015        0.719
           18:36:37.959      37.853    34.025       2.844            0.000        0.172

然后减法更容易:

In [25]: res["BuildIndex"] - res["Compress"]
Out[25]:
offset      ts
2015-10-27  18:31:40.318    6.140
            18:36:37.959    3.828
dtype: float64

In [26]: res["BuildIndex"] = res["BuildIndex"] - res["Compress"]

我怀疑这是最优雅的......