我是Python和pandas的新手,我试图了解两组代码之间的区别以及它们为什么做不同的事情。
我尝试将代码行分开,但是代码仍然给出不同的答案
物理专业的女学生比例是多少?
代码1:
fem_phy = df.query("gender == 'female' & major == 'Physics'").count() /
(df.query("gender=='female'")).count()
fem_phy
代码2:
len(df[(df['gender'] == 'female') & (df['admitted']) &
(df['major']=='Physics')]) / len(df[(df['gender']=='female') &
(df['admitted'])])
我希望第二组代码像第一组一样返回0.120623
答案 0 :(得分:0)
检查:
#sample data
df = pd.DataFrame({'gender':['female'] * 3 + ['male'] * 2,
'major':['Physics'] * 2 + ['Math'] * 3})
print (df)
gender major
0 female Physics
1 female Physics
2 female Math
3 male Math
4 male Math
要正确过滤行,请使用DataFrame.query
或boolean indexing
,对于相同的输出,秒df['admitted']
被删除:
print (df.query("gender == 'female' & major == 'Physics'"))
gender major
0 female Physics
1 female Physics
print (df.query("gender=='female'"))
gender major
0 female Physics
1 female Physics
2 female Math
print (df[(df['gender']=='female') & (df['major']=='Physics')])
gender major
0 female Physics
1 female Physics
print (df[(df['gender']=='female')])
gender major
0 female Physics
1 female Physics
2 female Math
问题出在DataFrame.count
上-返回不包含错配值的行数-因此,这里得到的Series
具有所有2
值(因为数据中没有缺失值):
print (df.query("gender == 'female' & major == 'Physics'").count())
gender 2
major 2
dtype: int64
正确使用的是len
的长度:
print (len(df.query("gender == 'female' & major == 'Physics'")))
2
print (len(df[(df['gender']=='female') & (df['major']=='Physics')]))
2
或仅按True
计入sum
个掩码值:
print ((df['gender']=='female') & (df['major']=='Physics'))
0 True
1 True
2 False
3 False
4 False
dtype: bool
print (((df['gender']=='female') & (df['major']=='Physics')).sum())
2
所以总的是:
mask1 = (df['gender']=='female')
mask2 = (df['major']=='Physics')
print ((mask1 & mask2).sum() / mask1.sum())
0.6666666666666666
df1 = df.query("gender == 'female' & major == 'Physics'")
df2 = df.query("gender=='female'")
print (len(df1) / len(df2))
0.6666666666666666
df1 = df[(df['gender']=='female') & (df['major']=='Physics')]
df2 = df[(df['gender']=='female')]
print (len(df1) / len(df2))
0.6666666666666666