Question

我有两个数据帧DF（~100k行），它是原始数据文件和DF1（15k行），映射文件。我正在尝试将DF.address和DF.Name列与DF1.Address和DF1.Name匹配。找到匹配后，DF1.ID应填入DF.ID（如果DF1.ID不是None），否则DF1.top_ID应填入DF.ID。

我能够在模糊逻辑的帮助下匹配地址和名称，但我坚持如何连接获得的结果来填充ID。

DF1-映射文件

DF原始数据文件

import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from operator import itemgetter




df=pd.read_excel("Test1", index=False)
df1=pd.read_excel("Test2", index=False)


df=df[df['ID'].isnull()]
zip_code=df['Zip'].tolist()
Facility_city=df['City'].tolist()
Address=df['Address'].tolist()
Name_list=df['Name'].tolist()


def fuzzy_match(x, choice, scorer, cutoff):
    return (process.extractOne(x, 
                               choices=choice, 
                               scorer=scorer, 
                               score_cutoff=cutoff))

for pin,city,Add,Name in zip(zip_code,Facility_city,Address,Name_list):
        #====Address Matching=====#
        choice=df1.loc[(df1['Zip']==pin) &(df1['City']==city),'Address1']
        result=fuzzy_match(Add,choice,fuzz.ratio,70)
        #====Name Matching========#
        if (result is not None):
            if (result[3]>70):
                choice_1=(df1.loc[(df1['Zip']==pin) &(df1['City']==city),'Name'])
                result_1=(fuzzy_match(Name,choice_1,fuzz.ratio,95))
                print(ID)
                if (result_1 is not None):
                    if(result_1[3]>95):
                       #Here populating the matching ID
                        print("ok")       


                    else:
                        continue
                else:
                    continue
            else:
                continue
        else:

Answer 1

IIUC：这是一个解决方案：

from fuzzywuzzy import fuzz
import pandas as pd

#Read raw data from clipboard
raw = pd.read_clipboard()

#Read map data from clipboard
mp = pd.read_clipboard()

#Merge raw data and mp data as following 
dfr = mp.merge(raw, on=['Hospital Name', 'City', 'Pincode'], how='outer')

#dfr will have many duplicate rows - eliminate duplicate
#To eliminate duplicate using toke_sort_ratio, compare address x and y
dfr['SCORE'] = dfr.apply(lambda x: fuzz.token_sort_ratio(x['Address_x'], x['Address_y']), axis=1)

#Filter only max ratio rows grouped by Address_x
dfr1 = dfr.iloc[dfr.groupby('Address_x').apply(lambda x: x['SCORE'].idxmax())]
#dfr1 shall have the desired result

此link包含测试所提供解决方案的示例数据。

excel数据的模糊逻辑-Pandas

1 个答案: