job Education Age Number of relatives
1 1 25 5
1 2 23 20
3 4 26 50
2 1 37 100
4 3 29 34
output Job Education agemin agemax relativesmin relativesmax
Category1 1 1 25 34 1 11
Category2 2 3 35 44 11 50
Category3 3 2 45 100 50 200
所以问题是如何在第一个数据集中添加列输出,但要基于条件(df1.job == df2.Job ...并且年龄介于第二个数据集中的agemin和agemax之间),输出应如下所示:< / p>
job Education Age Number of relatives output
1 1 25 5 Category1
1 2 23 20 Category2
3 4 26 50 Uncategorized
2 1 37 100 ....
4 3 29 34 ....
我已经尝试了几种方法来增加迭代次数并加入两个数据集,但是我没有得到我需要的结果
答案 0 :(得分:2)
IIUC,
我们可以合并,然后使用带有列分配的简单过滤器:
df2.columns = df2.columns.str.lower()
df_new = pd.merge(df1, df2[["job", "agemin", "agemax", "output"]], on="job", how="left")
df_new.loc[
~((df_new["Age"] >= df_new["agemin"]) & (df_new["Age"] <= df_new["agemax"])), "output"
] = "Uncategorised"
print(df_new)
job Education Age Number_of_relatives agemin agemax output
0 1 1 25 5 25.0 34.0 Category1
1 1 2 23 20 25.0 34.0 Uncategorised
2 3 4 26 50 45.0 100.0 Uncategorised
3 2 1 37 100 35.0 44.0 Category2
4 4 3 29 34 NaN NaN NaN
答案 1 :(得分:2)
这是将IntervalIndex.from_arrays
与reindex
和assign
结合使用的方式:
s = pd.IntervalIndex.from_arrays(df2['agemin'],df2['agemax'],'left')
d = df2.set_index(s).reindex(df1['Age']).loc[:,['output','Job']]
.groupby(level=0,sort=False).first().set_index('Job',append=True))
final = (df1.set_index(['Age','job']).assign(**d)
.fillna({'output':'Uncategorized'}).reset_index())
print(final)
Age job Education Number_of_relatives output
0 25 1 1 5 Category1
1 23 1 2 20 Uncategorized
2 26 3 4 50 Uncategorized
3 37 2 1 100 Category2
4 29 4 3 34 Uncategorized