Question

我正在使用Pandas 0.19。

考虑以下数据框：

FID  admin0  admin1  admin2  windspeed  population
0    cntry1  state1  city1   60km/h     700
1    cntry1  state1  city1   90km/h     210
2    cntry1  state1  city2   60km/h     100
3    cntry1  state2  city3   60km/h     70
4    cntry1  state2  city4   60km/h     180
5    cntry1  state2  city4   90km/h     370
6    cntry2  state3  city5   60km/h     890
7    cntry2  state3  city6   60km/h     120
8    cntry2  state3  city6   90km/h     420
9    cntry2  state3  city6   120km/h    360
10   cntry2  state4  city7   60km/h     740

如何创建这样的表？

                                population
                         60km/h  90km/h  120km/h
admin0  admin1  admin2  
cntry1  state1  city1    700     210      0
cntry1  state1  city2    100     0        0
cntry1  state2  city3    70      0        0
cntry1  state2  city4    180     370      0
cntry2  state3  city5    890     0        0
cntry2  state3  city6    120     420      360
cntry2  state4  city7    740     0        0

我尝试使用以下数据透视表：

table = pd.pivot_table(df,index=["admin0","admin1","admin2"], columns=["windspeed"], values=["population"],fill_value=0)

一般情况下它运行良好，但不幸的是我无法按正确的顺序对新列进行排序：120km / h列出现在60km / h和90km / h之前。如何指定新列的顺序？

此外，作为第二步，我需要为admin0和admin1添加小计。理想情况下，我需要的表应该是这样的：

                                population
                         60km/h  90km/h  120km/h
admin0  admin1  admin2  
cntry1  state1  city1    700     210      0
cntry1  state1  city2    100     0        0
        SUM state1       800     210      0
cntry1  state2  city3    70      0        0
cntry1  state2  city4    180     370      0
        SUM state2       250     370      0
SUM cntry1               1050    580      0
cntry2  state3  city5    890     0        0
cntry2  state3  city6    120     420      360
        SUM state3       1010    420      360
cntry2  state4  city7    740     0        0
        SUM state4       740     0        0
SUM cntry2               1750    420      360
SUM ALL                  2800    1000    360

Answer 1

您可以使用reindex()方法和自定义排序：

In [26]: table
Out[26]:
                     population
windspeed               120km/h 60km/h 90km/h
admin0 admin1 admin2
cntry1 state1 city1           0    700    210
              city2           0    100      0
       state2 city3           0     70      0
              city4           0    180    370
cntry2 state3 city5           0    890      0
              city6         360    120    420
       state4 city7           0    740      0

In [27]: cols = sorted(table.columns.tolist(), key=lambda x: int(x[1].replace('km/h','')))

In [28]: cols
Out[28]: [('population', '60km/h'), ('population', '90km/h'), ('population', '120km/h')]

In [29]: table = table.reindex(columns=cols)

In [30]: table
Out[30]:
                     population
windspeed                60km/h 90km/h 120km/h
admin0 admin1 admin2
cntry1 state1 city1         700    210       0
              city2         100      0       0
       state2 city3          70      0       0
              city4         180    370       0
cntry2 state3 city5         890      0       0
              city6         120    420     360
       state4 city7         740      0       0

Answer 2

小计和MultiIndex.from_arrays的解决方案。上次concat以及所有Dataframes，sort_index并添加所有sum：

#replace km/h and convert to int
df.windspeed = df.windspeed.str.replace('km/h','').astype(int)
print (df)
    FID  admin0  admin1 admin2  windspeed  population
0     0  cntry1  state1  city1         60         700
1     1  cntry1  state1  city1         90         210
2     2  cntry1  state1  city2         60         100
3     3  cntry1  state2  city3         60          70
4     4  cntry1  state2  city4         60         180
5     5  cntry1  state2  city4         90         370
6     6  cntry2  state3  city5         60         890
7     7  cntry2  state3  city6         60         120
8     8  cntry2  state3  city6         90         420
9     9  cntry2  state3  city6        120         360
10   10  cntry2  state4  city7         60         740

#pivoting
table = pd.pivot_table(df,
                       index=["admin0","admin1","admin2"], 
                       columns=["windspeed"], 
                       values=["population"],
                       fill_value=0)
print (table)
                    population          
windspeed                   60   90   120
admin0 admin1 admin2                     
cntry1 state1 city1         700  210    0
              city2         100    0    0
       state2 city3          70    0    0
              city4         180  370    0
cntry2 state3 city5         890    0    0
              city6         120  420  360
       state4 city7         740    0    0

#groupby and create sum dataframe by levels 0,1
df1 = table.groupby(level=[0,1]).sum()
df1.index = pd.MultiIndex.from_arrays([df1.index.get_level_values(0), 
                                       df1.index.get_level_values(1)+ '_sum', 
                                       len(df1.index) * ['']])
print (df1)
                   population          
windspeed                 60   90   120
admin0                                 
cntry1 state1_sum         800  210    0
       state2_sum         250  370    0
cntry2 state3_sum        1010  420  360
       state4_sum         740    0    0

df2 = table.groupby(level=0).sum()
df2.index = pd.MultiIndex.from_arrays([df2.index.values + '_sum',
                                       len(df2.index) * [''], 
                                       len(df2.index) * ['']])
print (df2)
             population          
windspeed           60   90   120
cntry1_sum         1050  580    0
cntry2_sum         1750  420  360

#concat all dataframes together, sort index
df = pd.concat([table, df1, df2]).sort_index(level=[0])

#add km/h to second level in columns
df.columns = pd.MultiIndex.from_arrays([df.columns.get_level_values(0),
                                       df.columns.get_level_values(1).astype(str) + 'km/h'])

#add all sum
df.loc[('All_sum','','')] = table.sum().values
print (df)
                             population               
                                 60km/h 90km/h 120km/h
admin0     admin1     admin2                          
cntry1     state1     city1         700    210       0
                      city2         100      0       0
           state1_sum               800    210       0
           state2     city3          70      0       0
                      city4         180    370       0
           state2_sum               250    370       0
cntry1_sum                         1050    580       0
cntry2     state3     city5         890      0       0
                      city6         120    420     360
           state3_sum              1010    420     360
           state4     city7         740      0       0
           state4_sum               740      0       0
cntry2_sum                         1750    420     360
All_sum                            2800   1000     360

通过评论编辑：

def f(x):
    print (x)
    if (len(x) > 1):
        return x.sum()

df1 = table.groupby(level=[0,1]).apply(f).dropna(how='all')
df1.index = pd.MultiIndex.from_arrays([df1.index.get_level_values(0), 
                                       df1.index.get_level_values(1)+ '_sum', 
                                       len(df1.index) * ['']])
print (df1)
                   population              
windspeed                 60     90     120
admin0                                     
cntry1 state1_sum       800.0  210.0    0.0
       state2_sum       250.0  370.0    0.0
cntry2 state3_sum      1010.0  420.0  360.0

Pandas数据透视表：列顺序和小计

2 个答案: