如果列包含第二个数据帧的列中的任何值,我想在我的数据框中创建一个新列。
第一个数据帧
WXYnineZAB
EFGsixHIJ
QRSeightTUV
GHItwoJKL
YZAfiveBCD
EFGsixHIJ
MNOthreePQR
ABConeDEF
MNOthreePQR
MNOthreePQR
YZAfiveBCD
WXYnineZAB
GHItwoJKL
KLMsevenNOP
EFGsixHIJ
ABConeDEF
KLMsevenNOP
QRSeightTUV
STUfourVWX
STUfourVWX
KLMsevenNOP
WXYnineZAB
CDEtenFGH
YZAfiveBCD
CDEtenFGH
QRSeightTUV
ABConeDEF
STUfourVWX
CDEtenFGH
GHItwoJKL
第二个数据帧
one
three
five
seven
nine
输出DataFrame
WXYnineZAB,nine
EFGsixHIJ,***
QRSeightTUV,***
GHItwoJKL,***
YZAfiveBCD,five
EFGsixHIJ,***
MNOthreePQR,three
ABConeDEF,one
MNOthreePQR,three
MNOthreePQR,three
YZAfiveBCD,five
WXYnineZAB,nine
GHItwoJKL,***
KLMsevenNOP,seven
EFGsixHIJ,***
ABConeDEF,one
KLMsevenNOP,seven
QRSeightTUV,***
STUfourVWX,***
STUfourVWX,***
KLMsevenNOP,seven
WXYnineZAB,nine
CDEtenFGH,***
YZAfiveBCD,five
CDEtenFGH,***
QRSeightTUV,***
ABConeDEF,one
STUfourVWX,***
CDEtenFGH,***
GHItwoJKL,***
为了便于解释,我将第一个数据帧设为3chars +搜索字符串+3chars,但我的实际文件没有这样的一致性。
答案 0 :(得分:0)
来源DF:
In [172]: d1
Out[172]:
txt
0 WXYnineZAB
1 EFGsixHIJ
2 QRSeightTUV
3 GHItwoJKL
4 YZAfiveBCD
.. ...
25 QRSeightTUV
26 ABConeDEF
27 STUfourVWX
28 CDEtenFGH
29 GHItwoJKL
[30 rows x 1 columns]
In [173]: d2
Out[173]:
word
0 one
1 three
2 five
3 seven
4 nine
从第二个DataFrame生成RegEx模式:
In [174]: pat = r'({})'.format(d2['word'].str.cat(sep='|'))
In [175]: pat
Out[175]: '(one|three|five|seven|nine)'
提取与RegEx模式匹配的单词并将其指定为新列:
In [176]: d1['new'] = d1['txt'].str.extract(pat, expand=False)
In [177]: d1
Out[177]:
txt new
0 WXYnineZAB nine
1 EFGsixHIJ NaN
2 QRSeightTUV NaN
3 GHItwoJKL NaN
4 YZAfiveBCD five
.. ... ...
25 QRSeightTUV NaN
26 ABConeDEF one
27 STUfourVWX NaN
28 CDEtenFGH NaN
29 GHItwoJKL NaN
[30 rows x 2 columns]
如果你想要同一步,你也可以填写NaN'
In [178]: d1['new'] = d1['txt'].str.extract(pat, expand=False).fillna('***')
In [179]: d1
Out[179]:
txt new
0 WXYnineZAB nine
1 EFGsixHIJ ***
2 QRSeightTUV ***
3 GHItwoJKL ***
4 YZAfiveBCD five
.. ... ...
25 QRSeightTUV ***
26 ABConeDEF one
27 STUfourVWX ***
28 CDEtenFGH ***
29 GHItwoJKL ***
[30 rows x 2 columns]
答案 1 :(得分:0)
如果您想避免使用RegEx,这里是一个纯粹的基于列表的解决方案:
# Sample DataFrames (structure is borrowed from MaxU)
d1 = pd.DataFrame({'txt':['WXYnineZAB','EFGsixHIJ','QRSeightTUV','GHItwoJKL']})
d2 = pd.DataFrame({'word':['two','six']})
# Check if word exists in any txt (1-liner).
exists = [list(d2.word[[word in txt for word in d2.word]])[0] if sum([word in txt for word in d2.word]) == 1 else '***' for txt in d1.txt]
# Resulting output
res = pd.DataFrame(zip(d1.txt,exists), columns = ['text','word'])
结果:
text word
0 WXYnineZAB ***
1 EFGsixHIJ six
2 QRSeightTUV ***
3 GHItwoJKL two