我有以下数据框
In[45]: data[:10]
Out[45]:
Z A beta2 M shell
0 100 200 0.3112 197.2 -4.213
1 100 200 -0.4197 202 -1.143
2 100 200 0.03205 203 0
3 100 201 0.2967 191 -4.434
4 100 201 -0.4893 196.1 -4.691
5 100 202 0.3084 183.4 -4.134
6 100 202 -0.4873 188.2 -4.75
7 100 202 -0.2483 188.4 -1.106
8 100 203 0.3069 177.1 -4.355
9 101 203 -0.4956 182.5 -5.217
我的问题是,我如何以这样的方式对数据进行分组/转换:我有一个带有(Z,A)的MultiIndex作为索引(或MultiIndexes),考虑到数据不是唯一的?为了明确我的目标,这是我期望实现的目标:
beta2[1] beta2[2] beta2[3] M[1] M[2] M[3] shell[1] shell[2] shell[3]
Z A
0 100 200 0.3112 -0.4197 0.03205 197.2 202 203 -4.213 -1.143 0
1 100 201 0.2967 0.4893 NaN 191 196.1 NaN -4.434 -4.691 NaN
2 100 202 0.3084 -0.4873 NaN 183.4 188.2 NaN -4.134 -4.75 NaN
3 100 203 0.3069 NaN NaN 177.1 NaN NaN -4.355 NaN NaN
4 101 203 -0.4956 NaN NaN 182.5 NaN NaN -5.217 NaN NaN
据我所知,这涉及至少两个步骤,一个用于唯一性,另一个用于Z,A中的索引,因此可以理解其中一个步骤中的任何帮助,是否有一些数据结构可能更适用于这个问题?
编辑:我发现了这一行:
data = data.set_index(('Z','A'))
解决了Z,A中索引的问题。不幸的是,只有当(Z,A)对是唯一的时,这才有效。
答案 0 :(得分:6)
我有一个未解决的问题可以解决这些问题:
https://github.com/pydata/pandas/issues/388
这是一个解决方案。首先是一个简单(而且效率不高)的函数来获取组序数:
def group_position(*args):
"""
Get group position
"""
from collections import defaultdict
table = defaultdict(int)
result = []
for tup in zip(*args):
result.append(table[tup])
table[tup] += 1
return np.array(result)
即
In [49]: group_position(df['Z'], df['A'])
Out[49]: array([0, 1, 2, 0, 1, 0, 1, 2, 0, 0])
现在将其用作辅助索引变量并取消堆栈:
In [52]: df
Out[52]:
Z A beta2 M shell
0 100 200 0.31120 197.2 -4.213
1 100 200 -0.41970 202.0 -1.143
2 100 200 0.03205 203.0 0.000
3 100 201 0.29670 191.0 -4.434
4 100 201 -0.48930 196.1 -4.691
5 100 202 0.30840 183.4 -4.134
6 100 202 -0.48730 188.2 -4.750
7 100 202 -0.24830 188.4 -1.106
8 100 203 0.30690 177.1 -4.355
9 101 203 -0.49560 182.5 -5.217
In [53]: df['pos'] = group_position(df['Z'], df['A'])
In [54]: df.set_index(['Z', 'A', 'pos']).unstack('pos')
Out[54]:
beta2 M shell
pos 0 1 2 0 1 2 0 1 2
Z A
100 200 0.3112 -0.4197 0.03205 197.2 202.0 203.0 -4.213 -1.143 0.000
201 0.2967 -0.4893 NaN 191.0 196.1 NaN -4.434 -4.691 NaN
202 0.3084 -0.4873 -0.24830 183.4 188.2 188.4 -4.134 -4.750 -1.106
203 0.3069 NaN NaN 177.1 NaN NaN -4.355 NaN NaN
101 203 -0.4956 NaN NaN 182.5 NaN NaN -5.217 NaN NaN
最终改变它,就像你展示的那样:
In [61]: result = df.set_index(['Z', 'A', 'pos']).unstack('pos')
In [62]: result.rename(columns=lambda x: '%s[%d]' % (x[0], x[1]+1)).reset_index()
Out[62]:
Z A beta2[1] beta2[2] beta2[3] M[1] M[2] M[3] shell[1] shell[2] shell[3]
0 100 200 0.3112 -0.4197 0.03205 197.2 202.0 203.0 -4.213 -1.143 0.000
1 100 201 0.2967 -0.4893 NaN 191.0 196.1 NaN -4.434 -4.691 NaN
2 100 202 0.3084 -0.4873 -0.24830 183.4 188.2 188.4 -4.134 -4.750 -1.106
3 100 203 0.3069 NaN NaN 177.1 NaN NaN -4.355 NaN NaN
4 101 203 -0.4956 NaN NaN 182.5 NaN NaN -5.217 NaN NaN