我有一个类似于以下内容的pandas数据框,并通过列id
保存数据组:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(10, 4), columns=list('ABCD'))
df['id'] = ['W', 'W', 'W', 'Z', 'Z', 'Y', 'Y', 'Y', 'Z', 'Z']
print(df)
A B C D id
0 0.347501 -1.152416 1.441144 -0.144545 w
1 0.775828 -1.176764 0.203049 -0.305332 w
2 1.036246 -0.467927 0.088138 -0.438207 w
3 -0.737092 -0.231706 0.268403 0.464026 x
4 -1.857346 -1.420284 -0.515517 -0.231774 x
5 -0.970731 0.217890 0.193814 -0.078838 y
6 -0.318314 -0.244348 0.162103 1.204386 y
7 0.340199 1.074977 1.201068 -0.431473 y
8 0.202050 0.790434 0.643458 -0.068620 z
9 -0.882865 0.687325 -0.008771 -0.066912 z
现在我想创建新的数据帧(名为df_w,df_x,df_y,df_z),这些数据帧仅保留原始数据帧中的数据,并在一些可迭代的数据帧中进行最佳组合,例如:清单:
df_w
A B C D id
0 0.347501 -1.152416 1.441144 -0.144545 w
1 0.775828 -1.176764 0.203049 -0.305332 w
2 1.036246 -0.467927 0.088138 -0.438207 w
df_x
A B C D id
0 -0.737092 -0.231706 0.268403 0.464026 x
1 -1.857346 -1.420284 -0.515517 -0.231774 x
df_y
A B C D id
0 -0.970731 0.217890 0.193814 -0.078838 y
1 -0.318314 -0.244348 0.162103 1.204386 y
2 0.340199 1.074977 1.201068 -0.431473 y
df_z
A B C D id
0 0.202050 0.790434 0.643458 -0.068620 z
1 -0.882865 0.687325 -0.008771 -0.066912 z
使用groupby,apply和/或applymap以及函数是否有任何智能(矢量化pandas)方法来实现这一目标?
我正在考虑迭代数据帧,但它似乎不是很优雅..
提前感谢任何提示!
答案 0 :(得分:6)
我们可以创建DF的词典:
In [166]: dfs = {k:v for k,v in df.groupby('id')}
In [168]: dfs.keys()
Out[168]: dict_keys(['W', 'Y', 'Z'])
In [169]: dfs['W']
Out[169]:
A B C D id
0 -0.373021 -0.555218 0.022980 -0.512323 W
1 -1.599466 0.637292 0.045059 -0.334030 W
2 0.100659 0.557068 0.142226 -0.186214 W
In [170]: dfs['Y']
Out[170]:
A B C D id
5 0.540107 -0.739077 0.992408 2.010203 Y
6 -0.201376 -0.913222 -0.173284 1.837442 Y
7 -1.367659 0.915360 0.072720 -0.886071 Y
In [171]: dfs['Z']
Out[171]:
A B C D id
3 -0.329087 0.842431 0.839319 -0.597823 Z
4 -0.594375 -0.950486 1.125584 0.116599 Z
8 0.366667 -0.978279 -1.449893 0.192451 Z
9 -0.007439 -0.084612 0.010192 -0.417602 Z
带有重置索引的更新::
In [177]: {k:v.reset_index(drop=True) for k,v in df.groupby('id')}
Out[177]:
{'W': A B C D id
0 -0.373021 -0.555218 0.022980 -0.512323 W
1 -1.599466 0.637292 0.045059 -0.334030 W
2 0.100659 0.557068 0.142226 -0.186214 W,
'Y': A B C D id
0 0.540107 -0.739077 0.992408 2.010203 Y
1 -0.201376 -0.913222 -0.173284 1.837442 Y
2 -1.367659 0.915360 0.072720 -0.886071 Y,
'Z': A B C D id
0 -0.329087 0.842431 0.839319 -0.597823 Z
1 -0.594375 -0.950486 1.125584 0.116599 Z
2 0.366667 -0.978279 -1.449893 0.192451 Z
3 -0.007439 -0.084612 0.010192 -0.417602 Z}
答案 1 :(得分:3)
我认为最好通过将dict
对象转换为groupby
然后转换为tuples
来创建dict
:
#for index starts from 0
df.index = df.groupby('id').cumcount()
dfs = dict(tuple(df.groupby('id')))
print (dfs)
{'W': A B C D id
0 1.331587 0.715279 -1.545400 -0.008384 W
1 0.621336 -0.720086 0.265512 0.108549 W
2 0.004291 -0.174600 0.433026 1.203037 W, 'Y': A B C D id
0 -1.977728 -1.743372 0.266070 2.384967 Y
1 1.123691 1.672622 0.099149 1.397996 Y
2 -0.271248 0.613204 -0.267317 -0.549309 Y, 'Z': A B C D id
0 -0.965066 1.028274 0.228630 0.445138 Z
1 -1.136602 0.135137 1.484537 -1.079805 Z
2 0.132708 -0.476142 1.308473 0.195013 Z
3 0.400210 -0.337632 1.256472 -0.731970 Z}
print (dfs['Y'])
A B C D id
0 -1.977728 -1.743372 0.266070 2.384967 Y
1 1.123691 1.672622 0.099149 1.397996 Y
2 -0.271248 0.613204 -0.267317 -0.549309 Y
有趣的是可以使用globals的自定义DataFrame名称,但更好的是dict:
for i, df in df.groupby('id'):
globals()['df_' + i] = df.reset_index(drop=True)
print (df_Y)
A B C D id
0 -1.977728 -1.743372 0.266070 2.384967 Y
1 1.123691 1.672622 0.099149 1.397996 Y
2 -0.271248 0.613204 -0.267317 -0.549309 Y