Python:如何将复杂的SQL聚合语句转换为pandas?

时间:2014-12-02 22:11:36

标签: python pandas dataframe

使用pandas,使用以下方法创建SQL语句组的最佳方法是什么:

  • 每个字段的不同聚合函数(例如,我需要总和) field1,field2的平均值和field3的最大值)
  • 稍微多一些 复数计算如sum(field1)/ sum(field2),例如加权 平均值

假设我在城市级别有一个包含数据的表格,我想按国家和地区汇总。在SQL中我写道:

select Country, Region
, count(*) as '# of cities'
,sum(GDP) as GDP
,avg(Population) as 'avg # inhabitants per city'
,sum(male_population) / sum(Population) as '% of male population'
from CityTable
group by Country, Region

我怎么能在熊猫中做同样的事情?谢谢!

2 个答案:

答案 0 :(得分:1)

>>> df
   Country  Region  GDP  Population  male_population
0      USA      TX   10         100               50
1      USA      TX   11         120               60
2      USA      KY   11         200              120
3  Austria  Wienna    5          50               34
>>>
>>> df2 = df.groupby(['Country','Region']).agg({'GDP': [np.size, np.sum], 'Population': [np.average, np.sum], 'male_population': np.sum})
>>> df2
                GDP     male_population Population     
               size sum             sum    average  sum
Country Region                                         
Austria Wienna    1   5              34         50   50
USA     KY        1  11             120        200  200
        TX        2  21             110        110  220
>>>
>>> df2['% of male population'] = df2['male_population','sum'].divide(df2['Population','sum'])
>>> df2
                GDP     male_population Population      % of male population
               size sum             sum    average  sum                     
Country Region                                                              
Austria Wienna    1   5              34         50   50                 0.68
USA     KY        1  11             120        200  200                 0.60
        TX        2  21             110        110  220                 0.50
>>>
>>> del df2['male_population', 'sum']
>>> del df2['Population', 'sum']
>>> df2.columns = ['# of cities', 'GDP', 'avg # inhabitants per city', '% of male population']

结果

>>> df2
                # of cities  GDP  avg # inhabitants per city  % of male population
Country Region                                                                    
Austria Wienna            1    5                          50                  0.68
USA     KY                1   11                         200                  0.60
        TX                2   21                         110                  0.50

答案 1 :(得分:1)

这是另一种选择:

In [1]: import pandas as pd    

In [2]: df = pd.read_csv('countries.csv')

In [3]: df
Out[3]: 
  Country Region  GDP  Population  male_population
0      US     TX   10         100               50
1      US     TX   11         120               60
2      US     KY   11         200              120
3      AU     WN    5          50               34
4      AU     SY    8         100               60

In [4]: df.groupby(['Country', 'Region']).apply(lambda gb: pd.Series({
   ...:     '# of cities': len(gb),
   ...:     'GDP': gb['GDP'].sum(),
   ...:     'avg': gb['Population'].mean(),
   ...:     '% of male population': float(gb['male_population'].sum()) / gb['Population'].sum(),
   ...: }))
Out[4]: 
                # of cities  % of male population  GDP  avg
Country Region                                             
AU      SY                1                  0.60    8  100
        WN                1                  0.68    5   50
US      KY                1                  0.60   11  200
        TX                2                  0.50   21  110

这种方法的一个缺陷是您无法在查询中引用计算,即:重复使用人口总和来计算平均值和%男性结果。