我有以下pandas.DataFrame
对象:
offset ts op time
0 0.000000 2015-10-27 18:31:40.318 Decompress 2.953
1 0.000000 2015-10-27 18:31:40.318 DeserializeBond 0.015
32 0.000000 2015-10-27 18:31:40.318 Compress 17.135
33 0.000000 2015-10-27 18:31:40.318 BuildIndex 19.494
34 0.000000 2015-10-27 18:31:40.318 InsertIndex 0.625
35 0.000000 2015-10-27 18:31:40.318 Compress 16.970
36 0.000000 2015-10-27 18:31:40.318 BuildIndex 18.954
37 0.000000 2015-10-27 18:31:40.318 InsertIndex 0.047
38 0.000000 2015-10-27 18:31:40.318 Compress 16.017
39 0.000000 2015-10-27 18:31:40.318 BuildIndex 17.814
40 0.000000 2015-10-27 18:31:40.318 InsertIndex 0.047
77 4.960683 2015-10-27 18:36:37.959 Decompress 2.844
78 4.960683 2015-10-27 18:36:37.959 DeserializeBond 0.000
108 4.960683 2015-10-27 18:36:37.959 Compress 17.758
109 4.960683 2015-10-27 18:36:37.959 BuildIndex 19.742
110 4.960683 2015-10-27 18:36:37.959 InsertIndex 0.110
111 4.960683 2015-10-27 18:36:37.959 Compress 16.267
112 4.960683 2015-10-27 18:36:37.959 BuildIndex 18.111
113 4.960683 2015-10-27 18:36:37.959 InsertIndex 0.062
我想按(offset, ts, op)
字段进行分组,并总结time
个值:
df = df.groupby(['offset', 'ts', 'op']).sum()
到目前为止一切顺利:
time
offset ts op
0.000000 2015-10-27 18:31:40.318 BuildIndex 56.262
Compress 50.122
Decompress 2.953
DeserializeBond 0.015
InsertIndex 0.719
4.960683 2015-10-27 18:36:37.959 BuildIndex 37.853
Compress 34.025
Decompress 2.844
DeserializeBond 0.000
InsertIndex 0.172
问题是,我必须从每个组中的Compress
- 中减去BuildIndex
时间。 I was recommended使用DataFrame.xs()
,我想出了以下内容:
diff = df.xs("BuildIndex", level="op") - df.xs("Compress", level="op")
diff['op'] = 'BuildIndex'
diff = diff.reset_index().groupby(['offset', 'ts', 'op']).agg(lambda val: val)
df.update(diff)
它完成了这项工作,但我强烈认为必须有一个更优雅的解决方案。
有人可以建议更好的方法吗?
答案 0 :(得分:1)
注意:您的行:
diff = diff.reset_index().groupby(['offset', 'ts', 'op']).agg(lambda val: val)
不必要,因为差异不变(因为它已经通过以前的groupby而独特)。
一点点黑客就是使用drop_levels=False
和.values(因此在减去时会忽略索引),这有点厚颜无耻,因为它假设每个组都有一个" BuildIndex"和一个" op"行,可能不安全。
In [11]: diff = df1.xs("BuildIndex", level="op", drop_level=False) - df1.xs("Compress", level="op").values
In [12]: diff
Out[12]:
time
offset ts op
2015-10-27 18:31:40.318 BuildIndex 6.140
18:36:37.959 BuildIndex 3.828
我很想在这里取消堆叠,因为数据真的是二维的:
In [21]: res = df1.unstack("op")
In [22]: res
Out[22]:
time
op BuildIndex Compress Decompress DeserializeBond InsertIndex
offset ts
2015-10-27 18:31:40.318 56.262 50.122 2.953 0.015 0.719
18:36:37.959 37.853 34.025 2.844 0.000 0.172
目前还不清楚这是否是MultiIndex列中的值:
In [23]: res.columns = res.columns.get_level_values(1)
In [24]: res
Out[24]:
op BuildIndex Compress Decompress DeserializeBond InsertIndex
offset ts
2015-10-27 18:31:40.318 56.262 50.122 2.953 0.015 0.719
18:36:37.959 37.853 34.025 2.844 0.000 0.172
然后减法更容易:
In [25]: res["BuildIndex"] - res["Compress"]
Out[25]:
offset ts
2015-10-27 18:31:40.318 6.140
18:36:37.959 3.828
dtype: float64
In [26]: res["BuildIndex"] = res["BuildIndex"] - res["Compress"]
我怀疑这是最优雅的......