我有一个大型数据框,我可以从中获取groupby
所需的数据。我需要从新数据帧的索引中获取几个单独的列。
原始数据框的一部分如下所示:
code place vl year week
0 111.0002.0056 region1 1 2017 29
1 112.6500.2285 region2 1 2017 31
2 112.5600.6325 region2 1 2017 30
3 112.5600.6325 region2 1 2017 30
4 112.5600.8159 region2 1 2017 30
5 111.0002.0056 region2 1 2017 29
6 111.0002.0056 region2 1 2017 30
7 111.0002.0056 region2 1 2017 28
8 112.5600.8159 region3 1 2017 31
9 112.5600.8159 region3 1 2017 28
10 111.0002.0114 region3 1 2017 31
....
应用groupby
后,它看起来像这样(代码:df_test1 = df_test.groupby(['code' , 'year', 'week', 'place'])['vl'].sum().unstack(fill_value=0)
):
place region1 region2 region3 region4 index1
code year week
111.0002.0006 2017 29 0 3 0 0 (111.0002.0006, 2017, 29)
30 0 7 0 0 (111.0002.0006, 2017, 30)
111.0002.0018 2017 29 0 0 0 0 (111.0002.0018, 2017, 29)
111.0002.0029 2017 30 0 0 0 0 (111.0002.0029, 2017, 30)
111.0002.0055 2017 28 0 33 0 8 (111.0002.0055, 2017, 28)
29 1 155 2 41 (111.0002.0055, 2017, 29)
30 0 142 1 39 (111.0002.0055, 2017, 30)
31 0 31 0 13 (111.0002.0055, 2017, 31)
111.0002.0056 2017 28 9 36 0 4 (111.0002.0056, 2017, 28)
29 20 75 2 37 (111.0002.0056, 2017, 29)
30 17 81 2 33 (111.0002.0056, 2017, 30)
....
我将索引保存在单独的列index1
中(代码:df_test1['index1'] = df_test1.index
)
我需要离开专栏index1
三列code
,year
和week
。
结果应如下所示:
region1 region2 region3 region4 code year week
0 3 0 0 111.0002.0006 2017 29
0 7 0 0 111.0002.0006 2017 30
0 0 0 0 111.0002.0018 2017 29
0 0 0 0 111.0002.0029 2017 30
0 33 0 8 111.0002.0055 2017 28
1 155 2 41 111.0002.0055 2017 29
0 142 1 39 111.0002.0055 2017 30
0 31 0 13 111.0002.0055 2017 31
....
我很感激任何建议!
答案 0 :(得分:5)
您添加reset_index
代替df_test1['index1'] = df_test1.index
,干净df
添加rename_axis
- 删除列名place
:
df_test1 = df_test.groupby(['code' , 'year', 'week', 'place'])['vl'].sum() \
.unstack(fill_value=0) \
.reset_index() \
.rename_axis(None, axis=1)
print (df_test1)
code year week region1 region2 region3
0 111.0002.0056 2017 28 0 1 0
1 111.0002.0056 2017 29 1 1 0
2 111.0002.0056 2017 30 0 1 0
3 111.0002.0114 2017 31 0 0 1
4 112.5600.6325 2017 30 0 2 0
5 112.5600.8159 2017 28 0 0 1
6 112.5600.8159 2017 30 0 1 0
7 112.5600.8159 2017 31 0 0 1
8 112.6500.2285 2017 31 0 1 0
如有必要,最后更改列的排序:
#all cols are columns in df_test1
cols = ['code' , 'year', 'week']
df_test1 = df_test1[[x for x in df_test1.columns if x not in cols] + cols]
print (df_test1)
region1 region2 region3 code year week
0 0 1 0 111.0002.0056 2017 28
1 1 1 0 111.0002.0056 2017 29
2 0 1 0 111.0002.0056 2017 30
3 0 0 1 111.0002.0114 2017 31
4 0 2 0 112.5600.6325 2017 30
5 0 0 1 112.5600.8159 2017 28
6 0 1 0 112.5600.8159 2017 30
7 0 0 1 112.5600.8159 2017 31
8 0 1 0 112.6500.2285 2017 31
答案 1 :(得分:2)
或者你可以尝试这个pd.crosstab
df=df.set_index(['code', 'year', 'week','vl'])
df=pd.crosstab(df.index,df.place).reset_index()
df[['code', 'year', 'week','vl']]=df['row_0'].apply(pd.Series).drop('row_0',axis=1)
Out[32]:
place region1 region2 region3 code year week vl
0 0 1 0 111.0002.0056 2017 28 1
1 1 1 0 111.0002.0056 2017 29 1
2 0 1 0 111.0002.0056 2017 30 1
3 0 0 1 111.0002.0114 2017 31 1
4 0 2 0 112.5600.6325 2017 30 1
5 0 0 1 112.5600.8159 2017 28 1
6 0 1 0 112.5600.8159 2017 30 1
7 0 0 1 112.5600.8159 2017 31 1
8 0 1 0 112.6500.2285 2017 31 1
答案 2 :(得分:1)
您可以完全跳过创建index1
的过程,而使用get_level_values(<index>)
中的df_test1.index
方法。请参阅https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.get_level_values.html#pandas.MultiIndex.get_level_values,呼叫应该看起来像
df_test1['code'] = df_test1.index.get_level_values(0)
df_test1['year'] = df_test1.index.get_level_values(1)
df_test1['week'] = df_test1.index.get_level_values(2)
无论您通过何种方式生成MultiIndex(无论是通过groupby(),pivot_table()还是以其他方式生成),该方法都应该起作用。