如何在不完美的匹配中合并熊猫 DF?

时间:2021-03-03 20:34:01

标签: pandas dataframe

我正在尝试根据 x 列的完全匹配和 y 上的某种程度的部分匹配来合并/加入 companyname 数据帧列。

除了查看 SequenceMatcher(None, x_name, y_name).ratio() 返回的值(在我的例子中总是高于 0.8)之外,我没有尝试太多值得一提的内容。

x = pd.DataFrame([{'id': 1, 'name': 'Robert Jackson', 'company': 'Test inc.', 'tenure': 6},
                  {'id': 2, 'name': 'William Johnson', 'company': 'Test inc.', 'tenure': 6}]).set_index('id')
y = pd.DataFrame([{'id': 4, 'name': 'Bob Jackson', 'company': 'Test inc.', 'job': 'desk'},
                  {'id': 5, 'name': 'Willy Johnson', 'company': 'Test inc.', 'job': 'desk'}]).set_index('id')

goal = pd.DataFrame([{'x_id': 1, 'y_id': 4, 'x_name': 'Robert Jackson', 'y_name': 'Bob Jackson', 'company': 'Test inc.', 'tenure': 6, 'job': 'desk'},
                     {'x_id': 2, 'y_id': 5, 'x_name': 'William Johnson', 'y_name': 'Willy Johnson', 'company': 'Test inc.', 'tenure': 6, 'job': 'desk'}])

这样的事情可信吗?感谢您提供任何反馈,谢谢。

2 个答案:

答案 0 :(得分:1)

好问题!我正在关注其他答案,因为我最近做了很多类似的工作。我采用的一种低效方法是基于阈值使用 fuzzywuzzy

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=1):
    s = df_2[key2].tolist()    
    m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))    
    df_1['matches'] = m
  
    m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
    df_1['matches'] = m2
    
    return df_1

答案 1 :(得分:1)

我使用的解决方案是:

from difflib import SequenceMatcher
x['merge_name'] = x['name']
x['merge_comp'] = x['company']

for a, b in x[['name', 'company']].values:
    for ixb, (c,d) in enumerate(y[['name', 'company']].values):
        if SequenceMatcher(None,a,c).ratio() >= .8:
            y.loc[ixb,'merge_name'] = a
        if SequenceMatcher(None,b,d).ratio() == 1:
            y.loc[ixb,'merge_comp'] = b
    
goal = pd.merge(x,y, on=['merge_name', 'merge_comp'])

此函数在传递任意数量的列时起作用:

def sm_merge(df1, df2, columns=[], ratios=[], prefix='m_', reset_index=False, post_drop=True):
    if reset_index:
        df1.reset_index(inplace=True)
        df2.reset_index(inplace=True)
    flag = 0
    merge_columns = []
    r = len(columns)
    for f in range(r):
        df1[prefix + columns[flag]] = df1[columns[flag]]
        merge_columns.append(prefix + columns[flag])
        flag =+ 1
    flag = 0
    for f in range(r):
        for col_1 in df1[columns[flag]].values:
            for index, col_2 in enumerate(df2[columns[flag]].values):
                print(type(col_2))
                if SequenceMatcher(None,str(col_1),str(col_2)).ratio() >= ratios[flag]:
                    df2.loc[index, merge_columns[flag]] = col_1
        flag =+ 1
    df = pd.merge(df1,df2, on=merge_columns)
    if post_drop:
        df1.drop(columns=merge_columns, inplace=True)
        df2.drop(columns=merge_columns, inplace=True)
    return df
            
sm_merge(x, y, columns=['name', 'company'], ratios=[.8, 1], reset_index=True)

此函数适用于正好传递 2 列/比率:

def sm_merge(df1, df2, columns=[], ratios=[], prefix='m_', reset_index=True, post_drop=True):
    df1_c = df1.copy()
    df2_c = df2.copy()
    if reset_index:
        df1_c.reset_index(inplace=True)
        df2_c.reset_index(inplace=True)
    df1_c[prefix + columns[0]] = df1_c[columns[0]]
    df1_c[prefix + columns[1]] = df1_c[columns[1]]
    merge_columns = [prefix + columns[0], prefix + columns[1]]
    for col_1, col_2 in df1_c[[columns[0], columns[1]]].values:
        for index, (col_3, col_4) in enumerate(df2_c[[columns[0], columns[1]]].values):
            if SequenceMatcher(None, str(col_1), str(col_3)).ratio() >= ratios[0]:
                df2_c.loc[index, merge_columns[0]] = col_1
            if SequenceMatcher(None, str(col_2), str(col_4)).ratio() >= ratios[1]:
                df2_c.loc[index, merge_columns[1]] = col_2
    df = pd.merge(df1_c, df2_c, on=merge_columns)
    if post_drop:
        df.drop(columns=merge_columns, inplace=True)
    return df
sm_merge(x,y,columns=['name', 'company'], ratios=[.8,1])