我有两个熊猫数据帧df1
和df2
。我需要通过搜索df1
在df2['B']
中创建一个新列,以查看df1['A']
是否是df2['B']
的子字符串。如果存在匹配项,则返回df2['A']
中新列的df1['B']
值。
下面是示例数据框
df1
A B
9.female.ceo.,ceo, ?
9.female.ned.,ned,
9.female.ned.,chair,
2.female.ed.,ned,
2.female.ned.,ed,
9.female.chair.,ceo,
2.female.chair.,chair,
df2
A B
,ceo,ned, 2.male.chair.,ceo,ned,
,chair,ned, 2.male.ned.,chair,ned,
,ned, 2.female.ed.,ned,
,ceo,chair, 6.female.ed.,ceo,chair,
,ed,ceo, 6.male.chair.,ed,ceo,
,ceo,chair, 9.female.ed.,ceo,chair,
,ceo,ned, 9.female.chair.,ceo,ned,
,chair,(in ft10), 9.male.ceo.,chair,(in ft10),
由于df1['A']
包含df2['B']
的子字符串,因此在这种情况下无法进行合并
非常感谢任何指向正确方向的帮助。
预期结果
df1
A B
9.female.ceo.,ceo,
9.female.ned.,ned,
9.female.ned.,chair,
2.female.ed.,ned, ,ned,
2.female.ned.,ed,
9.female.chair.,ceo, ,ceo,ned,
2.female.chair.,chair,
答案 0 :(得分:1)
想法是按,
拆分并按issubset
进行匹配的创建集:
d = {k: set(v.split(',')) for k, v in df2.set_index('A')['B'].items()}
df1['B'] = [next(iter([k for k, v in d.items() if set(x.split(',')).issubset(v)]), '')
for x in df1['A']]
print (df1)
A B
0 9.female.ceo.,ceo,
1 9.female.ned.,ned,
2 9.female.ned.,chair,
3 2.female.ed.,ned, ,ned,
4 2.female.ned.,ed,
5 9.female.chair.,ceo, ,ceo,ned,
6 2.female.chair.,chair,
通过in
测试的解决方案:
d = df2.set_index('A')['B']
df1['B'] = [next(iter([k for k, v in d.items() if x in v]), '') for x in df1['A']]
print (df1)
A B
0 9.female.ceo.,ceo,
1 9.female.ned.,ned,
2 9.female.ned.,chair,
3 2.female.ed.,ned, ,ned,
4 2.female.ned.,ed,
5 9.female.chair.,ceo, ,ceo,ned,
6 2.female.chair.,chair,
另一种解决方案是使用merge
进行交叉连接,并使用in
进行测试子字符串:
df3 = df1.assign(tmp=1).merge(df2.assign(tmp=1), on='tmp', suffixes=('','_'))
df3 = df3.loc[[a in b for a, b in zip(df3['A'], df3['B_'])], ['A','A_']]
df = df1[['A']].merge(df3.rename(columns={'A_':'B'}), on='A', how='left')
print (df)
A B
0 9.female.ceo.,ceo, NaN
1 9.female.ned.,ned, NaN
2 9.female.ned.,chair, NaN
3 2.female.ed.,ned, ,ned,
4 2.female.ned.,ed, NaN
5 9.female.chair.,ceo, ,ceo,ned,
6 2.female.chair.,chair, NaN