我使用了来自pandas和pandasql的groupby函数作为数据集aadhaar_data,并获得了50岁以上的aadhaar_generated人口总数的不同值。
熊猫方法
import pandas as pd
df = pd.read_csv('data/aadhaar_data.csv')
# df = pd.read_csv('https://github.com/leonvanbokhorst/Udacity-DS101-IntroToDataScience/blob/master/Data/aadhaar_data.csv?raw=true')
df = df.rename(columns = lambda x: x.replace(' ', '_').lower())
mask = (df.age > 50) & (df.aadhaar_generated == 1)
df = df[mask]
df = df.groupby(["gender", "district"]).count()
df.reset_index()
print(df.head())
print(df.tail())
熊猫方法结果
age aadhaar_generated enrolment_rejected \
gender district
F Ahmadnagar 30 30 30
Ahmedabad 1 1 1
Ajmer 21 21 21
Akola 5 5 5
Alirajpur 40 40 40
pandasql方法
import pandas as pd
import pandasql as pdsql
df = pd.read_csv('https://github.com/leonvanbokhorst/Udacity-DS101-IntroToDataScience/blob/master/Data/aadhaar_data.csv?raw=true')
df = df.rename(columns = lambda x: x.replace(' ', '_').lower())
q = """
SELECT gender, district, sum(aadhaar_generated)
FROM df
WHERE age > 50
GROUP BY gender, district;
"""
# Execute your SQL command against the pandas frame
df = pdsql.sqldf(q.lower(), locals())
print(df.head())
print(df.tail())
pandasql结果
gender district sum(aadhaar_generated)
0 F Ahmadnagar 45
1 F Ahmed Nagar 0
2 F Ahmedabad 1
3 F Ajmer 27
4 F Akola 5
...
523 M West Singhbhum 227
524 M West Tripura 24
525 M Yadgir 12
526 M Yamuna Nagar 149
527 M Yavatmal 54
**问题**
我们如何从pandas方法中获得像pandasql
这样的结果?
也就是说,对于艾哈迈德纳加尔市,panda
方法使用了30个人生成了aadhar卡,而pandasql
方法则使用了45个人。怎么可能?
此外,pandaql的结果只有三个带有三个标题的列。
但是,panda
方法具有多索引和所有列,如何像pandaql
方法那样转换呢?