Pandas groupby和pandasql groupby对同一数据集给出不同的结果

时间:2018-08-17 22:03:03

标签: python pandas pandas-groupby

我使用了来自pandas和pandasql的groupby函数作为数据集aadhaar_data,并获得了50岁以上的aadhaar_generated人口总数的不同值。

熊猫方法

import pandas as pd 
df = pd.read_csv('data/aadhaar_data.csv')
# df = pd.read_csv('https://github.com/leonvanbokhorst/Udacity-DS101-IntroToDataScience/blob/master/Data/aadhaar_data.csv?raw=true')
df = df.rename(columns = lambda x: x.replace(' ', '_').lower())

mask = (df.age > 50) & (df.aadhaar_generated == 1)
df = df[mask]


df = df.groupby(["gender", "district"]).count()
df.reset_index()
print(df.head())
print(df.tail())

熊猫方法结果

                   age  aadhaar_generated  enrolment_rejected  \
gender district                                                 
F      Ahmadnagar   30                 30                  30   
       Ahmedabad     1                  1                   1   
       Ajmer        21                 21                  21   
       Akola         5                  5                   5   
       Alirajpur    40                 40                  40   

pandasql方法

import pandas as pd
import pandasql as pdsql

df = pd.read_csv('https://github.com/leonvanbokhorst/Udacity-DS101-IntroToDataScience/blob/master/Data/aadhaar_data.csv?raw=true')
df = df.rename(columns = lambda x: x.replace(' ', '_').lower())

q = """
SELECT gender, district, sum(aadhaar_generated)
FROM df
WHERE age > 50
GROUP BY gender, district;
"""

# Execute your SQL command against the pandas frame
df = pdsql.sqldf(q.lower(), locals())
print(df.head())
print(df.tail())

pandasql结果

  gender     district  sum(aadhaar_generated)
0      F   Ahmadnagar                      45
1      F  Ahmed Nagar                       0
2      F    Ahmedabad                       1
3      F        Ajmer                      27
4      F        Akola                       5
...
523      M  West Singhbhum                     227
524      M    West Tripura                      24
525      M          Yadgir                      12
526      M    Yamuna Nagar                     149
527      M        Yavatmal                      54

**问题**
我们如何从pandas方法中获得像pandasql这样的结果? 也就是说,对于艾哈迈德纳加尔市,panda方法使用了30个人生成了aadhar卡,而pandasql方法则使用了45个人。怎么可能?

此外,pandaql的结果只有三个带有三个标题的列。 但是,panda方法具有多索引和所有列,如何像pandaql方法那样转换呢?

0 个答案:

没有答案