如何根据相似度函数合并两个pandas DataFrame?

时间:2016-02-13 14:10:44

标签: python pandas merge fuzzy-comparison

给定数据集1

name,x,y
st. peter,1,2
big university portland,3,4

和数据集2

name,x,y
saint peter3,4
uni portland,5,6

目标是合并

d1.merge(d2, on="name", how="left")

虽然名字上没有完全匹配。所以我想做一种模糊匹配。在这种情况下,该技术无关紧要,更多如何将其有效地融入熊猫。

例如,st. peter可能与另一个saint peter匹配,但big university portland可能偏差太大而我们与uni portland不匹配。

考虑它的一种方法是允许加入最低的Levenshtein距离,但只有在低于5次编辑时才会加入(st. --> saint为4)。

结果数据框应仅包含行st. peter,并包含“名称”变体以及xy变量。

有没有办法使用pandas进行这种合并?

3 个答案:

答案 0 :(得分:2)

你看过fuzzywuzzy吗?

您可以执行以下操作:

import pandas as pd
import fuzzywuzzy.process as fwp

choices = list(df2.name)

def fmatch(row): 
    minscore=95 #or whatever score works for you
    choice,score = fwp.extractOne(row.name,choices)
    return choice if score > minscore else None

df1['df2_name'] = df1.apply(fmatch,axis=1)
merged = pd.merge(df1, 
                  df2,
                  left_on='df2_name',
                  right_on='name',
                  suffixes=['_df1','_df2'],
                  how = 'outer') # assuming you want to keep unmatched records

警告Emptor:我没有试过这个。

答案 1 :(得分:1)

我现在可以得到的最简单的想法是创建所有名称之间距离的特殊数据框:

>>> from Levenshtein import distance
>>> df1['dummy'] = 1
>>> df2['dummy'] = 1
>>> merger = pd.merge(df1, df2, on=['dummy'], suffixes=['1','2'])[['name1','name2', 'x2', 'y2']]
>>> merger
                     name1         name2  x2  y2
0                st. peter   saint peter   3   4
1                st. peter  uni portland   5   6
2  big university portland   saint peter   3   4
3  big university portland  uni portland   5   6

>>> merger['res'] = merger.apply(lambda x: distance(x['name1'], x['name2']), axis=1)
>>> merger
                     name1         name2  x2  y2  res
0                st. peter   saint peter   3   4    4
1                st. peter  uni portland   5   6    9
2  big university portland   saint peter   3   4   18
3  big university portland  uni portland   5   6   11
>>> merger = merger[merger['res'] <= 5]
>>> merger
       name1        name2  x2  y2  res
0  st. peter  saint peter   3   4    4

>>> del df1['dummy']
>>> del merger['res']
>>> pd.merge(df1, merger, how='left', left_on='name', right_on='name1')
                      name  x  y      name1        name2  x2  y2
0                st. peter  1  2  st. peter  saint peter   3   4
1  big university portland  3  4        NaN          NaN NaN NaN

答案 2 :(得分:0)

让我们说你有这个功能,如果有的话,它会返回最佳匹配,否则为:

def best_match(s, candidates):
    ''' Return the item in candidates that best matches s.

    Will return None if a good enough match is not found.
    '''
    # Some code here.

然后你可以加入它返回的值,但是你可以用不同的方式做到这会导致不同的输出(所以我想,我对这个问题看不太多):

(df1.assign(name=df1['name'].apply(lambda x: best_match(x, df2['name'])))
 .merge(df2, on='name', how='left'))

(df1.merge(df2.assign(name=df2['name'].apply(lambda x: best_match(x, df1['name'])))),
           on='name', how='left'))