Question

我有两个数据框如下：

data = {
    'Name': ['Drama', 'Tennis Elbow', 'Cricket & bat', 'Ant and Boat'],
    'Items': ['abc, def, kgf, do work', 'ball, jig, file code, sensor dye, gun', 'jack and jill, common, bitter', 
             'ram, krish, myran']
}
df1 = pd.DataFrame(data)

df1

    Name            Items
0   Drama           abc, def, kgf, do work
1   Tennis Elbow    ball, jig, file code, sensor dye, gun
2   Cricket & bat   jack and jill, common, bitter
3   Ant and Boat    ram, krish, myran

和

data2 = {
    'values': ['abc and sea', 'def work', 'abc', 'ram cold', 'myran add', 'check'],
    'gems': ['A1, A2, A3, A4', 'B1, A1, B2, B3', 'C1, A1', 'KS, KM', 'JP, CVK', 'KF, GF']  
}
df2 = pd.DataFrame(data2)

df2

    values        gems
0   abc and sea   A1, A2, A3, A4
1   def work      B1, A1, B2, B3
2   abc           C1, A1
3   ram cold      KS, KM
4   myran add     JP, CVK
5   check         KF, GF

我想将字符串或字符串包含的项目从 df1['Items'] 映射到 df2['values']，并在新列中创建一个具有映射值的新数据框，如下所示：

    values        gems              Name
0   abc and sea   A1, A2, A3, A4    Drama
1   def work      B1, A1, B2, B3    Drama
2   abc           C1, A1            Drama
3   ram cold      KS, KM            Ant and Boat
4   myran add     JP, CVK           Ant and Boat

Answer 1

一种方法是从 df1 创建一个映射字典，并使用它来映射来自 df2 的值。

split 单词 df1["Items"] 和 explode 生成的列表列，为每个单词创建映射器：

df1["Items"] = df1["Items"].str.split(", ")
mapper = df1.explode("Items")
mapper = dict(zip(mapper["Items"], mapper["Name"]))

使用映射器获取 df2["values"] 中单词的名称。

df2["Name"] = df2["values"].apply(lambda x: " ".join([mapper.get(word,"") for word in x.split()]).strip())
df2 = df2[df2["Name"]!=""]

输出：

>>>> df2
        values            gems          Name
0  abc and sea  A1, A2, A3, A4         Drama
1     def work  B1, A1, B2, B3         Drama
2          abc          C1, A1         Drama
3     ram cold          KS, KM  Ant and Boat
4    myran add         JP, CVK  Ant and Boat

Answer 2

首先用逗号分割 Items 列，去掉所有剩余的空格，然后分解并重置索引

>>> df1['Items'] = df1['Items'].str.split(',').apply(lambda x:[i.strip() for i in x])
>>> df1 = df1.explode('Items').reset_index(drop=True)

然后编写一个函数，该函数将返回 Name 或 NaN，根据条件 x 是否包含 {{ 1}} 列，如果是，返回第一个值，否则返回 x。

Items

最后，对第二个数据框的 NaN 列应用函数 >>> def getName(x): return next(iter(df1.loc[df1['Items'].apply(lambda item: item in x)]['Name']), np.nan)，将其分配给新列 getName，并删除 values 为 {{ 1}}。

Name

如何根据列中的拆分项目映射两个数据框？

2 个答案:

输出：