我有两个数据框:
data = {
'values': ['Cricket', 'Soccer', 'Football', 'Tennis', 'Badminton', 'Chess'],
'gems': ['A1K, A2M, JA3, AN4', 'B1, A1, Bn2, B3', 'CD1, A1', 'KWS, KQM', 'JP, CVK', 'KF, GF']
}
df1 = pd.DataFrame(data)
df1
values gems
0 Cricket A1K, A2M, JA3, AN4
1 Soccer B1, A1, Bn2, B3
2 Football CD1, A1
3 Tennis KWS, KQM
4 Badminton JP, CVK
5 Chess KF, GF
第二个数据框
data2 = {
'1C': ['B1', 'K1', 'A1K', 'J1', 'A4'],
'02C': ['Bn2', 'B3', 'JK', 'ZZ', 'ko'],
'34C': ['KF', 'CD1', 'B3','ji', 'HU']
}
df2 = pd.DataFrame(data2)
df2
1C 02C 34C
0 B1 Bn2 KF
1 K1 B3 CD1
2 A1K JK B3
3 J1 ZZ ji
4 A4 ko HU
我想在 df1['gems']
的每一列中检查 df2
中的项目,并表示它们的计数和重叠项目。预期输出为:
values gems 1C 1CGroup 02C 02CGroup 34C 34CGroup
0 Cricket A1K, A2M, JA3, AN4 1 A1K 0 NA 0 NA
1 Soccer B1, A1, Bn2, B3 1 Bn2 2 Bn2, B3 1 B3
2 Football CD1, A1 0 NA 0 NA 1 CD1
3 Tennis KWS, KQM 0 NA 0 NA 0 NA
4 Badminton JP, CVK 0 NA 0 NA 0 NA
5 Chess KF, GF 0 NA 0 NA 1 KF
答案 0 :(得分:7)
首先 str.split
和 explode
列 gems 和 reset_index
保留原始索引。然后对于 df2 的每一列,merge
与爆炸的宝石,groupby
原始索引并根据需要执行 count
和聚合
与join
。 pd.concat
合并每列并加入原始 df1。 fillna
包含 0 的计数列,如预期输出中所示。
# one row per gem used in the merge
df_ = df1['gems'].str.split(', ').explode().reset_index()
res = (
df1.join( #can join to df1 as we keep the original index value
pd.concat([df_.merge(df2[[col]], left_on='gems', right_on=col)
.groupby('index') # original index in df1
[col].agg(**{col: 'count', # do each aggregation
f'{col}Group':lambda x: ', '.join(x)})
for col in df2.columns], # do it for each column of df2
axis=1))
.fillna({col:0 for col in df2.columns}) #fill the count columns with 0
)
print(res)
values gems 1C 1CGroup 02C 02CGroup 34C 34CGroup
0 Cricket A1K, A2M, JA3, AN4 1.0 A1K 0.0 NaN 0.0 NaN
1 Soccer B1, A1, Bn2, B3 1.0 B1 2.0 Bn2, B3 1.0 B3
2 Football CD1, A1 0.0 NaN 0.0 NaN 1.0 CD1
3 Tennis KWS, KQM 0.0 NaN 0.0 NaN 0.0 NaN
4 Badminton JP, CVK 0.0 NaN 0.0 NaN 0.0 NaN
5 Chess KF, GF 0.0 NaN 0.0 NaN 1.0 KF
答案 1 :(得分:5)
首先为您的群组创建一个表格:
df3 = (pd.merge(df1['gems'].str.split(',\s+').explode().reset_index(),
df2.unstack().reset_index(level=0),
left_on='gems', right_on=0, how='left'
)
.pivot_table(index='index',
columns=['level_0'],
values='gems',
aggfunc=list)
)
输出:
level_0 02C 1C 34C
index
0 NaN [A1K] NaN
1 [Bn2, B3] [B1] [B3]
2 NaN NaN [CD1]
5 NaN NaN [KF]
然后生成计数并将所有内容与原始表连接:
pd.concat([df1,
pd.concat([df3.add_suffix('Group').applymap(lambda x: ','.join(x) if isinstance(x, list) else x),
df3.fillna('').applymap(len)],
axis=1).sort_index(axis=1)
], axis=1)
输出:
values gems 02C 02CGroup 1C 1CGroup 34C 34CGroup
0 Cricket A1K, A2M, JA3, AN4 0.0 NaN 1.0 A1K 0.0 NaN
1 Soccer B1, A1, Bn2, B3 2.0 Bn2, B3 1.0 B1 1.0 B3
2 Football CD1, A1 0.0 NaN 0.0 NaN 1.0 CD1
3 Tennis KWS, KQM NaN NaN NaN NaN NaN NaN
4 Badminton JP, CVK NaN NaN NaN NaN NaN NaN
5 Chess KF, GF 0.0 NaN 0.0 NaN 1.0 KF
编辑:替代字符串连接和计数
df3 = (pd.merge(df1['gems'].str.split(',\s+').explode().reset_index(),
df2.unstack().reset_index(level=0),
left_on='gems', right_on=0, how='left'
)
.pivot_table(index='index',
columns=['level_0'],
values='gems',
aggfunc=', '.join)
)
pd.concat([df1,
pd.concat([df3.add_suffix('Group'),
df3.applymap(lambda x: x.count(',')+1 if isinstance(x, str) else 0)],
axis=1).sort_index(axis=1)
], axis=1)
答案 2 :(得分:5)
findall
的解决方案对于 df2
中的每一列,找到 df1
的 gems 列中所有出现的列值,然后使用 map
和 len
来计算出现次数和可选join
与 str.join
for c in df2.columns:
s = df1['gems'].str.findall('|'.join(df2[c]))
df1[c] = s.map(len)
df1[c + 'group'] = s.str.join(', ')
print(df1)
values gems 1C 1Cgroup 02C 02Cgroup 34C 34Cgroup
0 Cricket A1K, A2M, JA3, AN4 1 A1K 0 0
1 Soccer B1, A1, Bn2, B3 1 B1 2 Bn2, B3 1 B3
2 Football CD1, A1 0 0 1 CD1
3 Tennis KWS, KQM 0 0 0
4 Badminton JP, CVK 0 0 0
5 Chess KF, GF 0 0 1 KF
答案 3 :(得分:3)
使用 set
和 apply
的最差解决方案:
df1.gems = df1.gems.str.split(', ')
df3 = df2.T
ix = 0
def func(row):
global ix
d = {}
for idx, val in enumerate(df3.values):
v = list(set(row) & set(val))
d[df3.index[idx]] = ', '.join(v)
d[f"{df3.index[idx]}Group"] = len(v)
ix = ix + 1
return pd.Series(d)
res = pd.concat([df1,df1['gems'].apply(func)], axis=1)
简洁的解决方案:
df1.gems = df1.gems.str.split(', ')
for col in df2.columns:
z = (zip(df1.gems, [df2[col].values] * len(df1)))
res = ([', '.join(list(set(a).intersection(b))) for a, b in z])
df1[col] = res
df1[f"{col}Group"] = (list(map(lambda x: len(x.split(', ')) if x!='' else 0, res)))
资源:
值 | 宝石 | 1C | 1CGroup | 02C | 02CGroup | 34C | 34CGroup | |
---|---|---|---|---|---|---|---|---|
0 | 板球 | [A1K、A2M、JA3、AN4] | A1K | 1 | 0 | 0 | ||
1 | 足球 | [B1, A1, Bn2, B3] | B1 | 1 | B3, Bn2 | 2 | B3 | 1 |
2 | 足球 | [CD1, A1] | 0 | 0 | CD1 | 1 | ||
3 | 网球 | [KWS, KQM] | 0 | 0 | 0 | |||
4 | 羽毛球 | [JP, CVK] | 0 | 0 | 0 | |||
5 | 国际象棋 | [KF, GF] | 0 | 0 | KF | 1 |