我有两个数据帧,
df1,
Name Stage Description key
0 Sri 1 Sri is one of the good singer in this two one
1 NaN 2 Thanks for reading two has
2 Ram 1 Ram is two of the good cricket player three
3 ganesh 1 one driver four
4 NaN 2 good buddies NaN
df2,
values
member of four
one of three friends
sri is a cricketer
Rahul has two brothers
如果密钥存在于df2.values中,我想用df2值替换df1 [“key”]。
I tried, df1["key"]=df2[df2["values"].str.contains("|".join(df2["values"].tolist()),na=False)]
但我的输出顺序是相同的,
我想,
output_df,
Name Stage Description key
0 Sri 1 Sri is one of the good singer in this two one of three friends
1 NaN 2 Thanks for reading Rahul has two brothers
2 Ram 1 Ram is two of the good cricket player one of three friends
3 ganesh 1 one driver member of four
4 NaN 2 good buddies NaN
答案 0 :(得分:2)
我将使用集合数组并使用<=
进行子集测试和numpy广播。
setify = lambda x: set(x.split())
v = df2['values'].values.astype(str)
k = df1['key'].values.astype(str)
i = df1.index
# These the sets
a = np.array([setify(x) for x in k.tolist()])
b = np.array([setify(x) for x in v.tolist()])
# This is the broadcasting
matches = (a[:, None] <= b)
# Additional testing that there exist any matches
any_ = matches.any(1)
# Test that wasn't null in the first place
nul_ = df1['key'].notnull().values
mask = any_ & nul_
# And argmax to find where the first set match is. There
# may be more than one match. I chose to use `assign`
# therefore I used `mask` to pass a slice of a series
# to target the correct rows.
df1.assign(key1=pd.Series(v[matches.argmax(1)], i)[mask])
Name Stage Description key key1
0 Sri 1 Sri is one of the good singer in this two one one of three friends
1 NaN 2 Thanks for reading two has Rahul has two brothers
2 Ram 1 Ram is two of the good cricket player three one of three friends
3 ganesh 1 one driver four member of four
4 NaN 2 good buddies NaN NaN