使用pandas,使用以下方法创建SQL语句组的最佳方法是什么:
假设我在城市级别有一个包含数据的表格,我想按国家和地区汇总。在SQL中我写道:
select Country, Region
, count(*) as '# of cities'
,sum(GDP) as GDP
,avg(Population) as 'avg # inhabitants per city'
,sum(male_population) / sum(Population) as '% of male population'
from CityTable
group by Country, Region
我怎么能在熊猫中做同样的事情?谢谢!
答案 0 :(得分:1)
>>> df
Country Region GDP Population male_population
0 USA TX 10 100 50
1 USA TX 11 120 60
2 USA KY 11 200 120
3 Austria Wienna 5 50 34
>>>
>>> df2 = df.groupby(['Country','Region']).agg({'GDP': [np.size, np.sum], 'Population': [np.average, np.sum], 'male_population': np.sum})
>>> df2
GDP male_population Population
size sum sum average sum
Country Region
Austria Wienna 1 5 34 50 50
USA KY 1 11 120 200 200
TX 2 21 110 110 220
>>>
>>> df2['% of male population'] = df2['male_population','sum'].divide(df2['Population','sum'])
>>> df2
GDP male_population Population % of male population
size sum sum average sum
Country Region
Austria Wienna 1 5 34 50 50 0.68
USA KY 1 11 120 200 200 0.60
TX 2 21 110 110 220 0.50
>>>
>>> del df2['male_population', 'sum']
>>> del df2['Population', 'sum']
>>> df2.columns = ['# of cities', 'GDP', 'avg # inhabitants per city', '% of male population']
结果
>>> df2
# of cities GDP avg # inhabitants per city % of male population
Country Region
Austria Wienna 1 5 50 0.68
USA KY 1 11 200 0.60
TX 2 21 110 0.50
答案 1 :(得分:1)
这是另一种选择:
In [1]: import pandas as pd
In [2]: df = pd.read_csv('countries.csv')
In [3]: df
Out[3]:
Country Region GDP Population male_population
0 US TX 10 100 50
1 US TX 11 120 60
2 US KY 11 200 120
3 AU WN 5 50 34
4 AU SY 8 100 60
In [4]: df.groupby(['Country', 'Region']).apply(lambda gb: pd.Series({
...: '# of cities': len(gb),
...: 'GDP': gb['GDP'].sum(),
...: 'avg': gb['Population'].mean(),
...: '% of male population': float(gb['male_population'].sum()) / gb['Population'].sum(),
...: }))
Out[4]:
# of cities % of male population GDP avg
Country Region
AU SY 1 0.60 8 100
WN 1 0.68 5 50
US KY 1 0.60 11 200
TX 2 0.50 21 110
这种方法的一个缺陷是您无法在查询中引用计算,即:重复使用人口总和来计算平均值和%男性结果。