我有两个数据框如下:
data = {
'Name': ['Drama', 'Tennis Elbow', 'Cricket & bat', 'Ant and Boat'],
'Items': ['abc, def, kgf, do work', 'ball, jig, file code, sensor dye, gun', 'jack and jill, common, bitter',
'ram, krish, myran']
}
df1 = pd.DataFrame(data)
df1
Name Items
0 Drama abc, def, kgf, do work
1 Tennis Elbow ball, jig, file code, sensor dye, gun
2 Cricket & bat jack and jill, common, bitter
3 Ant and Boat ram, krish, myran
和
data2 = {
'values': ['abc and sea', 'def work', 'abc', 'ram cold', 'myran add', 'check'],
'gems': ['A1, A2, A3, A4', 'B1, A1, B2, B3', 'C1, A1', 'KS, KM', 'JP, CVK', 'KF, GF']
}
df2 = pd.DataFrame(data2)
df2
values gems
0 abc and sea A1, A2, A3, A4
1 def work B1, A1, B2, B3
2 abc C1, A1
3 ram cold KS, KM
4 myran add JP, CVK
5 check KF, GF
我想将字符串或字符串包含的项目从 df1['Items']
映射到 df2['values']
,并在新列中创建一个具有映射值的新数据框,如下所示:
values gems Name
0 abc and sea A1, A2, A3, A4 Drama
1 def work B1, A1, B2, B3 Drama
2 abc C1, A1 Drama
3 ram cold KS, KM Ant and Boat
4 myran add JP, CVK Ant and Boat
答案 0 :(得分:2)
一种方法是从 df1 创建一个映射字典,并使用它来映射来自 df2 的值。
split
单词 df1["Items"] 和 explode
生成的列表列,为每个单词创建映射器:df1["Items"] = df1["Items"].str.split(", ")
mapper = df1.explode("Items")
mapper = dict(zip(mapper["Items"], mapper["Name"]))
df2["Name"] = df2["values"].apply(lambda x: " ".join([mapper.get(word,"") for word in x.split()]).strip())
df2 = df2[df2["Name"]!=""]
>>>> df2
values gems Name
0 abc and sea A1, A2, A3, A4 Drama
1 def work B1, A1, B2, B3 Drama
2 abc C1, A1 Drama
3 ram cold KS, KM Ant and Boat
4 myran add JP, CVK Ant and Boat
答案 1 :(得分:1)
首先用逗号分割 Items
列,去掉所有剩余的空格,然后分解并重置索引
>>> df1['Items'] = df1['Items'].str.split(',').apply(lambda x:[i.strip() for i in x])
>>> df1 = df1.explode('Items').reset_index(drop=True)
然后编写一个函数,该函数将返回 Name
或 NaN
,根据条件 x
是否包含 {{ 1}} 列,如果是,返回第一个值,否则返回 x
。
Items
最后,对第二个数据框的 NaN
列应用函数 >>> def getName(x):
return next(iter(df1.loc[df1['Items'].apply(lambda item: item in x)]['Name']),
np.nan)
,将其分配给新列 getName
,并删除 values
为 {{ 1}}。
Name