我正在尝试通过一组专业找到男女之间的工资差距。
这是我的桌子的文本版本:
gender field group logwage
0 male BUSINESS 7.229572
10 female BUSINESS 7.072464
1 male COMM/JOURN 7.108538
11 female COMM/JOURN 7.015018
2 male COMPSCI/STAT 7.340410
12 female COMPSCI/STAT 7.169401
3 male EDUCATION 6.888829
13 female EDUCATION 6.770255
4 male ENGINEERING 7.397082
14 female ENGINEERING 7.323996
5 male HUMANITIES 7.053048
15 female HUMANITIES 6.920830
6 male MEDICINE 7.319011
16 female MEDICINE 7.193518
17 female NATSCI 6.993337
7 male NATSCI 7.089232
18 female OTHER 6.881126
8 male OTHER 7.091698
9 male SOCSCI/PSYCH 7.197572
19 female SOCSCI/PSYCH 6.968322
diff不适用于我,因为它将占用每个连续专业之间的差异。
这是现在的代码:
for row in sorted_mfield:
if sorted_mfield['field group']==sorted_mfield['field group'].shift(1):
diff= lambda x: x[0]-x[1]
我的下一个策略是回到未排序的数据框,在该数据框中,男性和女性是自己的专栏,并从那开始有所作为,但是由于我花了一个小时才尝试这样做,所以对熊猫来说这是很新的,我以为我会问问一下,看看它是如何工作的。谢谢。
答案 0 :(得分:2)
我会考虑使用pivot
重塑您的DataFrame,使其更易于计算。
df.pivot(index='field group', columns='gender', values='logwage').rename_axis([None], axis=1)
# female male
#field group
#BUSINESS 7.072464 7.229572
#COMM/JOURN 7.015018 7.108538
#COMPSCI/STAT 7.169401 7.340410
#EDUCATION 6.770255 6.888829
#ENGINEERING 7.323996 7.397082
#HUMANITIES 6.920830 7.053048
#MEDICINE 7.193518 7.319011
#NATSCI 6.993337 7.089232
#OTHER 6.881126 7.091698
#SOCSCI/PSYCH 6.968322 7.197572
df.male - df.female
#field group
#BUSINESS 0.157108
#COMM/JOURN 0.093520
#COMPSCI/STAT 0.171009
#EDUCATION 0.118574
#ENGINEERING 0.073086
#HUMANITIES 0.132218
#MEDICINE 0.125493
#NATSCI 0.095895
#OTHER 0.210572
#SOCSCI/PSYCH 0.229250
#dtype: float64
答案 1 :(得分:2)
在数据的排序版本中使用Pandas.DataFrame.shift()的解决方案:
df.sort_values(by=['field group', 'gender'], inplace=True)
df['gap'] = df.logwage - df.logwage.shift(1)
df[df.gender =='male'][['field group', 'gap']]
使用示例数据生成以下输出:
field group gap
0 BUSINESS 0.157108
2 COMM/JOURN 0.093520
4 COMPSCI/STAT 0.171009
6 EDUCATION 0.118574
8 ENGINEERING 0.073086
10 HUMANITIES 0.132218
12 MEDICINE 0.125493
15 NATSCI 0.095895
17 OTHER 0.210572
18 SOCSCI/PSYCH 0.229250
注意:它认为每个字段组始终具有一对值。如果要验证它或消除不带该对的字段组,则下面的代码进行过滤:
df_grouped = df.groupby('field group')
df_filtered = df_grouped.filter(lambda x: len(x) == 2)