我有两个数据帧,我试图查找一个数据帧g中的部分字符串(文件名)是否存在于数据帧d中的完整字符串(完整文件名)中,并更新原始数据帧g中的匹配列。 >
test_func
例如我实质上是想在g.FULL_FILE
中查找d.FILENAME我尝试过g = pd.DataFrame([['c:\\ythisFile.pdf', 'thisFile.pdf'], ['c:\\ythatFile.exe', 'thatFile.exe'], ['c:\\ytheOtherFile.zip', 'theOtherFile.zip']], columns=['FULL_FILE', 'FILENAME'])
d = pd.DataFrame([['c:\\zthis_File.pdf', 'this_File.pdf'], ['c:\\zthatFile.exe', 'thatFile.exe'], ['c:\\ztheOtherFile.zip', 'ssss.zip']], columns=['FULL_FILE', 'FILENAME'])
,但是我认为这是一个错误,因为它正在寻找完全匹配的结果
我尝试了以下操作,但看起来比d.FULL_FILE中的g.FILENAME全部匹配。 g = g.merge(d,left_on = g.FILENAME.str.extract('(\ d +)',expand = False),right_on = d.FULL_FILE.str.extract('(\ d +)',expand = False ),如何=“内部”)
总体目标是:
1.将g.FULL_FILE与d.FULL_FILE列匹配。
2.如果不匹配,则如果g.FILENAME作为部分字符串匹配存在于d.FULL_FILE列中,则匹配g.FILENAME
3.如果仍然没有匹配项,请检查d.FULL_FILE列中g.FILENAME列的最后10个字符是否匹配(以防g.FULL_FILE中有特殊字符)
请帮助。我已经进行了数小时的研究,可以找到一些类似解决方案的解决方案,但并不能完全解决这个问题,并且很难适应这些问题。
答案 0 :(得分:1)
我无法完全理解您想要的结果,但这是您提到的前两个步骤中的最佳镜头。 (我不理解您对检查最后10个字符或它与特殊字符的关系的评论,因此我跳过了这一部分。)
In [80]: g
Out[80]:
FULL_FILE FILENAME
0 c:\ythisFile.pdf thisFile.pdf
1 c:\ythatFile.exe thatFile.exe
2 c:\ytheOtherFile.zip theOtherFile.zip
In [81]: d
Out[81]:
FULL_FILE FILENAME
0 c:\zthis_File.pdf this_File.pdf
1 c:\zthatFile.exe thatFile.exe
2 c:\ztheOtherFile.zip ssss.zip
In [82]: temp1 = pd.merge(
g,
d,
on='FULL_FILE',
how='left',
suffixes=('_g', '_d')
)
In [83]: temp1
Out[83]:
FULL_FILE FILENAME_g FILENAME_d
0 c:\ythisFile.pdf thisFile.pdf NaN
1 c:\ythatFile.exe thatFile.exe NaN
2 c:\ytheOtherFile.zip theOtherFile.zip NaN
In [84]: step2 = d.FULL_FILE.map(
lambda x: temp1.loc[temp1.FILENAME_d.isnull()].FILENAME_g.map(
lambda y: y in x
).any()
)
In [85]: step2
Out[85]:
0 False
1 True
2 True
Name: FULL_FILE, dtype: bool
In [86]: temp2 = pandas.merge(
temp1,
d.loc[step2].drop('FULL_FILE', axis=1),
left_index=True,
right_index=True,
how='left'
)
In [87]: temp2
Out[87]:
FULL_FILE FILENAME_g FILENAME_d FILENAME
0 c:\ythisFile.pdf thisFile.pdf NaN NaN
1 c:\ythatFile.exe thatFile.exe NaN thatFile.exe
2 c:\ytheOtherFile.zip theOtherFile.zip NaN ssss.zip
In [88]: temp2['FILENAME_d'] = temp2['FILENAME_d'].fillna(temp2.FILENAME)
In [89]:temp2.drop('FILENAME', axis=1)
Out[89]:
FULL_FILE FILENAME_g FILENAME_d
0 c:\ythisFile.pdf thisFile.pdf NaN
1 c:\ythatFile.exe thatFile.exe thatFile.exe
2 c:\ytheOtherFile.zip theOtherFile.zip ssss.zip
请注意,这也适用于第一步中实际上存在匹配项的情况。例如,如果我将这样的条目添加到您的示例数据中:
In [135]: def fuzzy_match(g, d):
...: temp1 = pd.merge(
...: g,
...: d,
...: on='FULL_FILE',
...: how='left',
...: suffixes=('_g', '_d')
...: )
...: step2 = d.FULL_FILE.map(
...: lambda x: temp1.loc[temp1.FILENAME_d.isnull()].FILENAME_g.map(
...: lambda y: y in x
...: ).any()
...: )
...: temp2 = pd.merge(
...: temp1,
...: d.loc[step2].drop('FULL_FILE', axis=1),
...: left_index=True,
...: right_index=True,
...: how='left'
...: )
...: temp2['FILENAME_d'] = temp2['FILENAME_d'].fillna(temp2.FILENAME)
...: return temp2.drop('FILENAME', axis=1)
...:
...:
In [136]: g = pd.DataFrame([['c:\\aFile.txt', 'aFile.txt'], ['c:\\ythisFile.pdf', 'thisFile.pdf'], ['c:\\ythatFile.exe', 'thatFile.exe'], ['c:\\ytheOtherFile.zip', '
...: theOtherFile.zip']], columns=['FULL_FILE', 'FILENAME']); d = pd.DataFrame([['c:\\aFile.txt', 'aFile.txt'], ['c:\\zthis_File.pdf', 'this_File.pdf'], ['c:\\z
...: thatFile.exe', 'thatFile.exe'], ['c:\\ztheOtherFile.zip', 'ssss.zip']], columns=['FULL_FILE', 'FILENAME'])
In [137]: g
Out[137]:
FULL_FILE FILENAME
0 c:\aFile.txt aFile.txt
1 c:\ythisFile.pdf thisFile.pdf
2 c:\ythatFile.exe thatFile.exe
3 c:\ytheOtherFile.zip theOtherFile.zip
In [138]: d
Out[138]:
FULL_FILE FILENAME
0 c:\aFile.txt aFile.txt
1 c:\zthis_File.pdf this_File.pdf
2 c:\zthatFile.exe thatFile.exe
3 c:\ztheOtherFile.zip ssss.zip
In [139]: fuzzy_match(g, d)
Out[139]:
FULL_FILE FILENAME_g FILENAME_d
0 c:\aFile.txt aFile.txt aFile.txt
1 c:\ythisFile.pdf thisFile.pdf NaN
2 c:\ythatFile.exe thatFile.exe thatFile.exe
3 c:\ytheOtherFile.zip theOtherFile.zip ssss.zip