来自一个数据帧的匹配字符串,该字符串是第二个数据帧中的部分字符串

时间:2018-08-16 10:36:07

标签: python pandas dataframe match partial

我有两个数据帧,我试图查找一个数据帧g中的部分字符串(文件名)是否存在于数据帧d中的完整字符串(完整文件名)中,并更新原始数据帧g中的匹配列。 >

test_func

例如我实质上是想在g.FULL_FILE

中查找d.FILENAME

我尝试过g = pd.DataFrame([['c:\\ythisFile.pdf', 'thisFile.pdf'], ['c:\\ythatFile.exe', 'thatFile.exe'], ['c:\\ytheOtherFile.zip', 'theOtherFile.zip']], columns=['FULL_FILE', 'FILENAME']) d = pd.DataFrame([['c:\\zthis_File.pdf', 'this_File.pdf'], ['c:\\zthatFile.exe', 'thatFile.exe'], ['c:\\ztheOtherFile.zip', 'ssss.zip']], columns=['FULL_FILE', 'FILENAME']) ,但是我认为这是一个错误,因为它正在寻找完全匹配的结果

我尝试了以下操作,但看起来比d.FULL_FILE中的g.FILENAME全部匹配。     g = g.merge(d,left_on = g.FILENAME.str.extract('(\ d +)',expand = False),right_on = d.FULL_FILE.str.extract('(\ d +)',expand = False ),如何=“内部”)

总体目标是: 1.将g.FULL_FILE与d.FULL_FILE列匹配。
2.如果不匹配,则如果g.FILENAME作为部分字符串匹配存在于d.FULL_FILE列中,则匹配g.FILENAME 3.如果仍然没有匹配项,请检查d.FULL_FILE列中g.FILENAME列的最后10个字符是否匹配(以防g.FULL_FILE中有特殊字符)

请帮助。我已经进行了数小时的研究,可以找到一些类似解决方案的解决方案,但并不能完全解决这个问题,并且很难适应这些问题。

1 个答案:

答案 0 :(得分:1)

我无法完全理解您想要的结果,但这是您提到的前两个步骤中的最佳镜头。 (我不理解您对检查最后10个字符或它与特殊字符的关系的评论,因此我跳过了这一部分。)

In [80]: g
Out[80]:
              FULL_FILE          FILENAME
0      c:\ythisFile.pdf      thisFile.pdf
1      c:\ythatFile.exe      thatFile.exe
2  c:\ytheOtherFile.zip  theOtherFile.zip

In [81]: d
Out[81]:
              FULL_FILE       FILENAME
0     c:\zthis_File.pdf  this_File.pdf
1      c:\zthatFile.exe   thatFile.exe
2  c:\ztheOtherFile.zip       ssss.zip

In [82]: temp1 = pd.merge(
    g, 
    d, 
    on='FULL_FILE', 
    how='left', 
    suffixes=('_g', '_d')
)

In [83]: temp1
Out[83]:
              FULL_FILE        FILENAME_g FILENAME_d
0      c:\ythisFile.pdf      thisFile.pdf        NaN
1      c:\ythatFile.exe      thatFile.exe        NaN
2  c:\ytheOtherFile.zip  theOtherFile.zip        NaN

In [84]: step2 = d.FULL_FILE.map(
    lambda x: temp1.loc[temp1.FILENAME_d.isnull()].FILENAME_g.map(
        lambda y: y in x
    ).any()
)

In [85]: step2
Out[85]:
0    False
1     True
2     True
Name: FULL_FILE, dtype: bool

In [86]: temp2 = pandas.merge(
    temp1, 
    d.loc[step2].drop('FULL_FILE', axis=1), 
    left_index=True, 
    right_index=True, 
    how='left'
)

In [87]: temp2
Out[87]:
              FULL_FILE        FILENAME_g FILENAME_d      FILENAME
0      c:\ythisFile.pdf      thisFile.pdf        NaN           NaN
1      c:\ythatFile.exe      thatFile.exe        NaN  thatFile.exe
2  c:\ytheOtherFile.zip  theOtherFile.zip        NaN      ssss.zip                  

In [88]: temp2['FILENAME_d'] = temp2['FILENAME_d'].fillna(temp2.FILENAME)

In [89]:temp2.drop('FILENAME', axis=1)
Out[89]:
              FULL_FILE        FILENAME_g    FILENAME_d
0      c:\ythisFile.pdf      thisFile.pdf           NaN
1      c:\ythatFile.exe      thatFile.exe  thatFile.exe
2  c:\ytheOtherFile.zip  theOtherFile.zip      ssss.zip

请注意,这也适用于第一步中实际上存在匹配项的情况。例如,如果我将这样的条目添加到您的示例数据中:

In [135]: def fuzzy_match(g, d):
     ...:     temp1 = pd.merge(
     ...:         g,
     ...:         d,
     ...:         on='FULL_FILE',
     ...:         how='left',
     ...:         suffixes=('_g', '_d')
     ...:     )
     ...:     step2 = d.FULL_FILE.map(
     ...:         lambda x: temp1.loc[temp1.FILENAME_d.isnull()].FILENAME_g.map(
     ...:             lambda y: y in x
     ...:         ).any()
     ...:     )
     ...:     temp2 = pd.merge(
     ...:         temp1,
     ...:         d.loc[step2].drop('FULL_FILE', axis=1),
     ...:         left_index=True,
     ...:         right_index=True,
     ...:         how='left'
     ...:     )
     ...:     temp2['FILENAME_d'] = temp2['FILENAME_d'].fillna(temp2.FILENAME)
     ...:     return temp2.drop('FILENAME', axis=1)
     ...:
     ...:


In [136]: g = pd.DataFrame([['c:\\aFile.txt', 'aFile.txt'], ['c:\\ythisFile.pdf', 'thisFile.pdf'], ['c:\\ythatFile.exe', 'thatFile.exe'], ['c:\\ytheOtherFile.zip', '
     ...: theOtherFile.zip']], columns=['FULL_FILE', 'FILENAME']); d = pd.DataFrame([['c:\\aFile.txt', 'aFile.txt'], ['c:\\zthis_File.pdf', 'this_File.pdf'], ['c:\\z
     ...: thatFile.exe', 'thatFile.exe'], ['c:\\ztheOtherFile.zip', 'ssss.zip']], columns=['FULL_FILE', 'FILENAME'])

In [137]: g
Out[137]:
              FULL_FILE          FILENAME
0          c:\aFile.txt         aFile.txt
1      c:\ythisFile.pdf      thisFile.pdf
2      c:\ythatFile.exe      thatFile.exe
3  c:\ytheOtherFile.zip  theOtherFile.zip

In [138]: d
Out[138]:
              FULL_FILE       FILENAME
0          c:\aFile.txt      aFile.txt
1     c:\zthis_File.pdf  this_File.pdf
2      c:\zthatFile.exe   thatFile.exe
3  c:\ztheOtherFile.zip       ssss.zip

In [139]: fuzzy_match(g, d)
Out[139]:
              FULL_FILE        FILENAME_g    FILENAME_d
0          c:\aFile.txt         aFile.txt     aFile.txt
1      c:\ythisFile.pdf      thisFile.pdf           NaN
2      c:\ythatFile.exe      thatFile.exe  thatFile.exe
3  c:\ytheOtherFile.zip  theOtherFile.zip      ssss.zip