我有一个数据框clg_df,如下所示:
prov wave clg_id bar
11 2005 9 500
我有一个函数,它将四列作为输入,并在另一个数据帧test_df上运行。
prov wave clg_id st_id score
11 2005 10 111 560
我想找到每个prov波的学生人数,其得分高于prov-wave-clgid组合所定义的栏。
最终结果应如下所示:
prov wave clg_id bar number
11 2005 9 500 40
我正在使用循环来实现所需的输出。是否可以使用apply函数?
def gen_envy(clg,prov,year,test_df,clg_bar_df):
# select subframe from the clg_bar_df for a given clg prov year combination
condition_1 = clg_bar_df['provid'] == prov
condition_2 = clg_bar_df['wave'] == year
condition_3 = clg_bar_df['clg_id'] == clg
# select the bar associated with the clg prov year
temp = clg_bar_df.loc[condition_1 & condition_2 & condition_3]
#print(temp)
bar = temp['bar'].values[0]
#print(bar)
# select a temp2 df from the gaokao_bar_df for a given prov year combination
condition_4= gaokao_bar_df['provid'] == prov
condition_5= gaokao_bar_df['wave'] == year
temp2 = gaokao_bar_df.loc[condition_4 & condition_5]
# within the temp2 df, generate a new column with 1 as the score larger than
the cutoff, 0 smaller than the cutoff
# two conditions need to be satisfied:
# 1. Own score higher than the bar
# 2. Enrolled to a school with cut off lower than the bar
condition_6= temp2['score'] > bar
condition_7= temp2['bar'] < bar
x = condition_6 & condition_7
#print(x)
# return the fraction of envy
return x.mean()
我使用循环来调用函数:
for i in range(len(clg_bar_df)):
clg = clg_bar_df['clg_id'].iloc[i]
prov = clg_bar_df['provid'].iloc[i]
year = clg_bar_df['wave'].iloc[i]
clg_bar_df['envy'].iloc[i] = gen_envy(clg,prov,year,gaokao_bar_df,clg_bar_df)