如何使用新列中重叠项目的输出映射两个数据框?

时间:2021-07-29 14:53:47

标签: python pandas dataframe

我有两个数据框:

data = {
    'values': ['Cricket', 'Soccer', 'Football', 'Tennis', 'Badminton', 'Chess'],
    'gems': ['A1K, A2M, JA3, AN4', 'B1, A1, Bn2, B3', 'CD1, A1', 'KWS, KQM', 'JP, CVK', 'KF, GF']  
}
df1 = pd.DataFrame(data)

df1

    values       gems
0   Cricket      A1K, A2M, JA3, AN4
1   Soccer       B1, A1, Bn2, B3
2   Football     CD1, A1
3   Tennis       KWS, KQM
4   Badminton    JP, CVK
5   Chess        KF, GF

第二个数据框

data2 = {
    '1C': ['B1', 'K1', 'A1K', 'J1', 'A4'],
    '02C': ['Bn2', 'B3', 'JK', 'ZZ', 'ko'],
    '34C': ['KF', 'CD1', 'B3','ji', 'HU']
}
df2 = pd.DataFrame(data2)

df2

    1C  02C 34C
0   B1  Bn2 KF
1   K1  B3  CD1
2   A1K JK  B3
3   J1  ZZ  ji
4   A4  ko  HU

我想在 df1['gems'] 的每一列中检查 df2 中的项目,并表示它们的计数和重叠项目。预期输出为:

    values    gems                  1C  1CGroup   02C   02CGroup    34C 34CGroup
0   Cricket   A1K, A2M, JA3, AN4    1   A1K       0     NA          0   NA
1   Soccer    B1, A1, Bn2, B3       1   Bn2       2     Bn2, B3     1   B3
2   Football  CD1, A1               0   NA        0     NA          1   CD1
3   Tennis    KWS, KQM              0   NA        0     NA          0   NA
4   Badminton JP, CVK               0   NA        0     NA          0   NA
5   Chess     KF, GF                0   NA        0     NA          1   KF

4 个答案:

答案 0 :(得分:7)

首先 str.splitexplode 列 gems 和 reset_index 保留原始索引。然后对于 df2 的每一列,merge 与爆炸的宝石,groupby 原始索引并根据需要执行 count 和聚合 与joinpd.concat 合并每列并加入原始 df1。 fillna 包含 0 的计数列,如预期输出中所示。

# one row per gem used in the merge
df_ = df1['gems'].str.split(', ').explode().reset_index()

res = (
    df1.join( #can join to df1 as we keep the original index value
        pd.concat([df_.merge(df2[[col]], left_on='gems', right_on=col)
                      .groupby('index') # original index in df1
                      [col].agg(**{col: 'count', # do each aggregation
                                   f'{col}Group':lambda x: ', '.join(x)}) 
                   for col in df2.columns], # do it for each column of df2
                  axis=1))
        .fillna({col:0 for col in df2.columns}) #fill the count columns with 0
)
print(res)
      values                gems   1C 1CGroup  02C 02CGroup  34C 34CGroup
0    Cricket  A1K, A2M, JA3, AN4  1.0     A1K  0.0      NaN  0.0      NaN
1     Soccer     B1, A1, Bn2, B3  1.0      B1  2.0  Bn2, B3  1.0       B3
2   Football             CD1, A1  0.0     NaN  0.0      NaN  1.0      CD1
3     Tennis            KWS, KQM  0.0     NaN  0.0      NaN  0.0      NaN
4  Badminton             JP, CVK  0.0     NaN  0.0      NaN  0.0      NaN
5      Chess              KF, GF  0.0     NaN  0.0      NaN  1.0       KF

答案 1 :(得分:5)

首先为您的群组创建一个表格:

df3 = (pd.merge(df1['gems'].str.split(',\s+').explode().reset_index(),
                df2.unstack().reset_index(level=0),
                left_on='gems', right_on=0, how='left'
               )
         .pivot_table(index='index',
                      columns=['level_0'],
                      values='gems',
                      aggfunc=list)
      )

输出:

level_0        02C     1C    34C
index                           
0              NaN  [A1K]    NaN
1        [Bn2, B3]   [B1]   [B3]
2              NaN    NaN  [CD1]
5              NaN    NaN   [KF]

然后生成计数并将所有内容与原始表连接:

pd.concat([df1,
           pd.concat([df3.add_suffix('Group').applymap(lambda x: ','.join(x) if isinstance(x, list) else x),
                      df3.fillna('').applymap(len)],
                     axis=1).sort_index(axis=1)
          ], axis=1)

输出:

      values                gems  02C 02CGroup   1C 1CGroup  34C 34CGroup
0    Cricket  A1K, A2M, JA3, AN4  0.0      NaN  1.0     A1K  0.0      NaN
1     Soccer     B1, A1, Bn2, B3  2.0  Bn2, B3  1.0      B1  1.0       B3
2   Football             CD1, A1  0.0      NaN  0.0     NaN  1.0      CD1
3     Tennis            KWS, KQM  NaN      NaN  NaN     NaN  NaN      NaN
4  Badminton             JP, CVK  NaN      NaN  NaN     NaN  NaN      NaN
5      Chess              KF, GF  0.0      NaN  0.0     NaN  1.0       KF

编辑:替代字符串连接和计数

df3 = (pd.merge(df1['gems'].str.split(',\s+').explode().reset_index(),
                df2.unstack().reset_index(level=0),
                left_on='gems', right_on=0, how='left'
               )
         .pivot_table(index='index',
                      columns=['level_0'],
                      values='gems',
                      aggfunc=', '.join)
      )

pd.concat([df1,
           pd.concat([df3.add_suffix('Group'),
                      df3.applymap(lambda x: x.count(',')+1 if isinstance(x, str) else 0)],
                     axis=1).sort_index(axis=1)
          ], axis=1)

答案 2 :(得分:5)

findall 的解决方案

对于 df2 中的每一列,找到 df1 的 gems 列中所有出现的列值,然后使用 maplen 来计算出现次数和可选joinstr.join

for c in df2.columns:
    s = df1['gems'].str.findall('|'.join(df2[c]))

    df1[c] = s.map(len)
    df1[c + 'group'] = s.str.join(', ')

print(df1)

      values                gems  1C 1Cgroup  02C 02Cgroup  34C 34Cgroup
0    Cricket  A1K, A2M, JA3, AN4   1     A1K    0             0         
1     Soccer     B1, A1, Bn2, B3   1      B1    2  Bn2, B3    1       B3
2   Football             CD1, A1   0            0             1      CD1
3     Tennis            KWS, KQM   0            0             0         
4  Badminton             JP, CVK   0            0             0         
5      Chess              KF, GF   0            0             1       KF

答案 3 :(得分:3)

使用 setapply 的最差解决方案:

df1.gems =  df1.gems.str.split(', ')

df3 = df2.T
ix = 0

def func(row):
    global ix
    d = {}
    for idx, val in enumerate(df3.values):
        v = list(set(row) & set(val))
        d[df3.index[idx]] = ', '.join(v)
        d[f"{df3.index[idx]}Group"] = len(v) 
    ix = ix + 1
    return pd.Series(d)
res = pd.concat([df1,df1['gems'].apply(func)], axis=1)

简洁的解决方案:

df1.gems =  df1.gems.str.split(', ')
for col in df2.columns:
    z = (zip(df1.gems, [df2[col].values] * len(df1)))
    res = ([', '.join(list(set(a).intersection(b))) for a, b in z])
    df1[col] = res
    df1[f"{col}Group"] = (list(map(lambda x: len(x.split(', ')) if x!='' else 0, res)))

资源:

<头>
宝石 1C 1CGroup 02C 02CGroup 34C 34CGroup
0 板球 [A1K、A2M、JA3、AN4] A1K 1 0 0
1 足球 [B1, A1, Bn2, B3] B1 1 B3, Bn2 2 B3 1
2 足球 [CD1, A1] 0 0 CD1 1
3 网球 [KWS, KQM] 0 0 0
4 羽毛球 [JP, CVK] 0 0 0
5 国际象棋 [KF, GF] 0 0 KF 1