数据简化如下:
mon site year data1 data2
1 57598 2001 58 1383
2 57598 2001 75 549
1 57598 2002 118 1337
2 57598 2002 162 2213
1 50136 2000 -282 134
2 50136 2000 -242 0
1 50136 2001 -126 102
1 50844 2000 152 411
2 50844 2000 70 117
1 50844 2002 -74 44
2 50844 2002 -173 83
我想提取data1和data2并更改为以下格式:
这是data1
:
2000 2000 2001 2001 2002 2002
1 2 1 2 1 2
50136 -282 -242 -126 NA NA NA
50844 152 70 NA NA -74 -173
57598 58 75 NA NA 118 162
和data2
将保存为具有相同表单的新文件data1
。
我想使用pandas.groupby
进行操作,但以下代码是错误:
df['data1'].groupby(df['year'],df['mon'],df['site'])
使用groupby
很容易吗?
答案 0 :(得分:2)
df1 = df.set_index(['year','mon','site'])['data1'].unstack(level=[0,1]).sort_index(axis=1)
print (df1)
year 2000 2001 2002
mon 1 2 1 2 1 2
site
50136 -282.0 -242.0 -126.0 NaN NaN NaN
50844 152.0 70.0 NaN NaN -74.0 -173.0
57598 NaN NaN 58.0 75.0 118.0 162.0
但如果得到:
ValueError:索引包含重复的条目,无法重塑
使用groupby
或pivot_table
的其他解决方案:
df1 = df.groupby(['year','mon','site'])['data1'].mean().unstack(level=[0,1])
print (df1)
year 2000 2001 2002
mon 1 2 1 2 1 2
site
50136 -282.0 -242.0 -126.0 NaN NaN NaN
50844 152.0 70.0 NaN NaN -74.0 -173.0
57598 NaN NaN 58.0 75.0 118.0 162.0
pivot_table
的另一个可能的解决方案,其默认aggfunc
为np.mean
,但可以更改为其他功能,例如aggfunc='sum'
,...:
print (df.pivot_table(index='site', columns=['year','mon'], values='data1', aggfunc=np.mean))
year 2000 2001 2002
mon 1 2 1 2 1 2
site
50136 -282.0 -242.0 -126.0 NaN NaN NaN
50844 152.0 70.0 NaN NaN -74.0 -173.0
57598 NaN NaN 58.0 75.0 118.0 162.0
上次使用DataFrame.to_csv
将文件写入csv
。
df1.to_csv('file_out.csv')
答案 1 :(得分:0)
将df设置为您需要的形状:
result = df.groupby(['site','mon','year'])['data1'].mean().unstack().unstack()
Out[310]:
year 2000 2001 2002
mon 1 2 1 2 1 2
site
50136 -282.0 -242.0 -126.0 NaN NaN NaN
50844 152.0 70.0 NaN NaN -74.0 -173.0
57598 NaN NaN 58.0 75.0 118.0 162.0
将其保存到csv:
df.groupby(['site','mon','year'])['data1'].mean().unstack().unstack().to_csv('data1.csv')
df.groupby(['site','mon','year'])['data2'].mean().unstack().unstack().to_csv('data2.csv')