我正在使用Pandas 0.19。
考虑以下数据框:
FID admin0 admin1 admin2 windspeed population
0 cntry1 state1 city1 60km/h 700
1 cntry1 state1 city1 90km/h 210
2 cntry1 state1 city2 60km/h 100
3 cntry1 state2 city3 60km/h 70
4 cntry1 state2 city4 60km/h 180
5 cntry1 state2 city4 90km/h 370
6 cntry2 state3 city5 60km/h 890
7 cntry2 state3 city6 60km/h 120
8 cntry2 state3 city6 90km/h 420
9 cntry2 state3 city6 120km/h 360
10 cntry2 state4 city7 60km/h 740
如何创建这样的表?
population
60km/h 90km/h 120km/h
admin0 admin1 admin2
cntry1 state1 city1 700 210 0
cntry1 state1 city2 100 0 0
cntry1 state2 city3 70 0 0
cntry1 state2 city4 180 370 0
cntry2 state3 city5 890 0 0
cntry2 state3 city6 120 420 360
cntry2 state4 city7 740 0 0
我尝试使用以下数据透视表:
table = pd.pivot_table(df,index=["admin0","admin1","admin2"], columns=["windspeed"], values=["population"],fill_value=0)
一般情况下它运行良好,但不幸的是我无法按正确的顺序对新列进行排序:120km / h列出现在60km / h和90km / h之前。如何指定新列的顺序?
此外,作为第二步,我需要为admin0和admin1添加小计。理想情况下,我需要的表应该是这样的:
population
60km/h 90km/h 120km/h
admin0 admin1 admin2
cntry1 state1 city1 700 210 0
cntry1 state1 city2 100 0 0
SUM state1 800 210 0
cntry1 state2 city3 70 0 0
cntry1 state2 city4 180 370 0
SUM state2 250 370 0
SUM cntry1 1050 580 0
cntry2 state3 city5 890 0 0
cntry2 state3 city6 120 420 360
SUM state3 1010 420 360
cntry2 state4 city7 740 0 0
SUM state4 740 0 0
SUM cntry2 1750 420 360
SUM ALL 2800 1000 360
答案 0 :(得分:5)
您可以使用reindex()方法和自定义排序:
In [26]: table
Out[26]:
population
windspeed 120km/h 60km/h 90km/h
admin0 admin1 admin2
cntry1 state1 city1 0 700 210
city2 0 100 0
state2 city3 0 70 0
city4 0 180 370
cntry2 state3 city5 0 890 0
city6 360 120 420
state4 city7 0 740 0
In [27]: cols = sorted(table.columns.tolist(), key=lambda x: int(x[1].replace('km/h','')))
In [28]: cols
Out[28]: [('population', '60km/h'), ('population', '90km/h'), ('population', '120km/h')]
In [29]: table = table.reindex(columns=cols)
In [30]: table
Out[30]:
population
windspeed 60km/h 90km/h 120km/h
admin0 admin1 admin2
cntry1 state1 city1 700 210 0
city2 100 0 0
state2 city3 70 0 0
city4 180 370 0
cntry2 state3 city5 890 0 0
city6 120 420 360
state4 city7 740 0 0
答案 1 :(得分:4)
小计和MultiIndex.from_arrays
的解决方案。上次concat
以及所有Dataframes
,sort_index
并添加所有sum
:
#replace km/h and convert to int
df.windspeed = df.windspeed.str.replace('km/h','').astype(int)
print (df)
FID admin0 admin1 admin2 windspeed population
0 0 cntry1 state1 city1 60 700
1 1 cntry1 state1 city1 90 210
2 2 cntry1 state1 city2 60 100
3 3 cntry1 state2 city3 60 70
4 4 cntry1 state2 city4 60 180
5 5 cntry1 state2 city4 90 370
6 6 cntry2 state3 city5 60 890
7 7 cntry2 state3 city6 60 120
8 8 cntry2 state3 city6 90 420
9 9 cntry2 state3 city6 120 360
10 10 cntry2 state4 city7 60 740
#pivoting
table = pd.pivot_table(df,
index=["admin0","admin1","admin2"],
columns=["windspeed"],
values=["population"],
fill_value=0)
print (table)
population
windspeed 60 90 120
admin0 admin1 admin2
cntry1 state1 city1 700 210 0
city2 100 0 0
state2 city3 70 0 0
city4 180 370 0
cntry2 state3 city5 890 0 0
city6 120 420 360
state4 city7 740 0 0
#groupby and create sum dataframe by levels 0,1
df1 = table.groupby(level=[0,1]).sum()
df1.index = pd.MultiIndex.from_arrays([df1.index.get_level_values(0),
df1.index.get_level_values(1)+ '_sum',
len(df1.index) * ['']])
print (df1)
population
windspeed 60 90 120
admin0
cntry1 state1_sum 800 210 0
state2_sum 250 370 0
cntry2 state3_sum 1010 420 360
state4_sum 740 0 0
df2 = table.groupby(level=0).sum()
df2.index = pd.MultiIndex.from_arrays([df2.index.values + '_sum',
len(df2.index) * [''],
len(df2.index) * ['']])
print (df2)
population
windspeed 60 90 120
cntry1_sum 1050 580 0
cntry2_sum 1750 420 360
#concat all dataframes together, sort index
df = pd.concat([table, df1, df2]).sort_index(level=[0])
#add km/h to second level in columns
df.columns = pd.MultiIndex.from_arrays([df.columns.get_level_values(0),
df.columns.get_level_values(1).astype(str) + 'km/h'])
#add all sum
df.loc[('All_sum','','')] = table.sum().values
print (df)
population
60km/h 90km/h 120km/h
admin0 admin1 admin2
cntry1 state1 city1 700 210 0
city2 100 0 0
state1_sum 800 210 0
state2 city3 70 0 0
city4 180 370 0
state2_sum 250 370 0
cntry1_sum 1050 580 0
cntry2 state3 city5 890 0 0
city6 120 420 360
state3_sum 1010 420 360
state4 city7 740 0 0
state4_sum 740 0 0
cntry2_sum 1750 420 360
All_sum 2800 1000 360
通过评论编辑:
def f(x):
print (x)
if (len(x) > 1):
return x.sum()
df1 = table.groupby(level=[0,1]).apply(f).dropna(how='all')
df1.index = pd.MultiIndex.from_arrays([df1.index.get_level_values(0),
df1.index.get_level_values(1)+ '_sum',
len(df1.index) * ['']])
print (df1)
population
windspeed 60 90 120
admin0
cntry1 state1_sum 800.0 210.0 0.0
state2_sum 250.0 370.0 0.0
cntry2 state3_sum 1010.0 420.0 360.0