如何在一个pandas数据框列中搜索字符串作为另一数据框列中的子字符串

时间:2019-02-24 13:49:04

标签: python pandas dataframe

我有两个熊猫数据帧df1df2。我需要通过搜索df1df2['B']中创建一个新列,以查看df1['A']是否是df2['B']的子字符串。如果存在匹配项,则返回df2['A']中新列的df1['B']值。

下面是示例数据框

df1

      A                  B           
9.female.ceo.,ceo,       ?
9.female.ned.,ned,
9.female.ned.,chair,
2.female.ed.,ned,
2.female.ned.,ed,
9.female.chair.,ceo,
2.female.chair.,chair,

df2

     A                B
,ceo,ned,          2.male.chair.,ceo,ned,
,chair,ned,        2.male.ned.,chair,ned,  
,ned,              2.female.ed.,ned,
,ceo,chair,        6.female.ed.,ceo,chair,
,ed,ceo,           6.male.chair.,ed,ceo,
,ceo,chair,        9.female.ed.,ceo,chair,
,ceo,ned,          9.female.chair.,ceo,ned,
,chair,(in ft10),  9.male.ceo.,chair,(in ft10),

由于df1['A']包含df2['B']的子字符串,因此在这种情况下无法进行合并

非常感谢任何指向正确方向的帮助。

预期结果

df1

      A                    B           
9.female.ceo.,ceo,       
9.female.ned.,ned,
9.female.ned.,chair,
2.female.ed.,ned,         ,ned,
2.female.ned.,ed,
9.female.chair.,ceo,      ,ceo,ned,
2.female.chair.,chair,  

1 个答案:

答案 0 :(得分:1)

想法是按,拆分并按issubset进行匹配的创建集:

d = {k: set(v.split(',')) for k, v in df2.set_index('A')['B'].items()}
df1['B'] = [next(iter([k for k, v in d.items() if set(x.split(',')).issubset(v)]), '') 
                      for x in df1['A']]
print (df1)
                        A          B
0      9.female.ceo.,ceo,           
1      9.female.ned.,ned,           
2    9.female.ned.,chair,           
3       2.female.ed.,ned,      ,ned,
4       2.female.ned.,ed,           
5    9.female.chair.,ceo,  ,ceo,ned,
6  2.female.chair.,chair,           

通过in测试的解决方案:

d = df2.set_index('A')['B']
df1['B'] = [next(iter([k for k, v in d.items() if x in v]), '')  for x in df1['A']]
print (df1)
                        A          B
0      9.female.ceo.,ceo,           
1      9.female.ned.,ned,           
2    9.female.ned.,chair,           
3       2.female.ed.,ned,      ,ned,
4       2.female.ned.,ed,           
5    9.female.chair.,ceo,  ,ceo,ned,
6  2.female.chair.,chair,           

另一种解决方案是使用merge进行交叉连接,并使用in进行测试子字符串:

df3 = df1.assign(tmp=1).merge(df2.assign(tmp=1), on='tmp', suffixes=('','_'))
df3 = df3.loc[[a in b for a, b in zip(df3['A'], df3['B_'])], ['A','A_']]

df = df1[['A']].merge(df3.rename(columns={'A_':'B'}), on='A', how='left')
print (df)
                        A          B
0      9.female.ceo.,ceo,        NaN
1      9.female.ned.,ned,        NaN
2    9.female.ned.,chair,        NaN
3       2.female.ed.,ned,      ,ned,
4       2.female.ned.,ed,        NaN
5    9.female.chair.,ceo,  ,ceo,ned,
6  2.female.chair.,chair,        NaN