试图了解为什么这段代码没有返回相同的值

时间:2019-06-16 06:18:23

标签: python pandas

我是Python和pandas的新手,我试图了解两组代码之间的区别以及它们为什么做不同的事情。

我尝试将代码行分开,但是代码仍然给出不同的答案

物理专业的女学生比例是多少?

代码1:

fem_phy = df.query("gender == 'female' & major == 'Physics'").count() / 
         (df.query("gender=='female'")).count()
fem_phy

代码2:

len(df[(df['gender'] == 'female') & (df['admitted']) & 
   (df['major']=='Physics')]) / len(df[(df['gender']=='female') & 
   (df['admitted'])])

我希望第二组代码像第一组一样返回0.120623

1 个答案:

答案 0 :(得分:0)

检查:

#sample data
df = pd.DataFrame({'gender':['female'] * 3 + ['male'] * 2,
                   'major':['Physics'] * 2 + ['Math'] * 3})

print (df)
   gender    major
0  female  Physics
1  female  Physics
2  female     Math
3    male     Math
4    male     Math

要正确过滤行,请使用DataFrame.queryboolean indexing,对于相同的输出,秒df['admitted']被删除:

print (df.query("gender == 'female' & major == 'Physics'"))
   gender    major
0  female  Physics
1  female  Physics

print (df.query("gender=='female'"))
   gender    major
0  female  Physics
1  female  Physics
2  female     Math

print (df[(df['gender']=='female') & (df['major']=='Physics')])
   gender    major
0  female  Physics
1  female  Physics

print (df[(df['gender']=='female')])
   gender    major
0  female  Physics
1  female  Physics
2  female     Math

问题出在DataFrame.count上-返回不包含错配值的行数-因此,这里得到的Series具有所有2值(因为数据中没有缺失值):

print (df.query("gender == 'female' & major == 'Physics'").count())
gender    2
major     2
dtype: int64

正确使用的是len的长度:

print (len(df.query("gender == 'female' & major == 'Physics'")))
2

print (len(df[(df['gender']=='female') & (df['major']=='Physics')]))
2

或仅按True计入sum个掩码值:

print ((df['gender']=='female') & (df['major']=='Physics'))
0     True
1     True
2    False
3    False
4    False
dtype: bool

print (((df['gender']=='female') & (df['major']=='Physics')).sum())
2

所以总的是:

mask1 = (df['gender']=='female')
mask2 = (df['major']=='Physics')
print ((mask1 & mask2).sum() / mask1.sum())
0.6666666666666666

df1 = df.query("gender == 'female' & major == 'Physics'")
df2 = df.query("gender=='female'")
print (len(df1) / len(df2))
0.6666666666666666

df1 = df[(df['gender']=='female') & (df['major']=='Physics')]
df2 = df[(df['gender']=='female')]
print (len(df1) / len(df2))
0.6666666666666666