excel数据的模糊逻辑-Pandas

时间:2018-03-27 07:50:06

标签: python-3.x pandas fuzzy

我有两个数据帧DF(~100k行),它是原始数据文件和DF1(15k行),映射文件。我正在尝试将DF.address和DF.Name列与DF1.Address和DF1.Name匹配。找到匹配后,DF1.ID应填入DF.ID(如果DF1.ID不是None),否则DF1.top_ID应填入DF.ID。

我能够在模糊逻辑的帮助下匹配地址和名称,但我坚持如何连接获得的结果来填充ID。

DF1-映射文件

DF原始数据文件

import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from operator import itemgetter




df=pd.read_excel("Test1", index=False)
df1=pd.read_excel("Test2", index=False)


df=df[df['ID'].isnull()]
zip_code=df['Zip'].tolist()
Facility_city=df['City'].tolist()
Address=df['Address'].tolist()
Name_list=df['Name'].tolist()


def fuzzy_match(x, choice, scorer, cutoff):
    return (process.extractOne(x, 
                               choices=choice, 
                               scorer=scorer, 
                               score_cutoff=cutoff))

for pin,city,Add,Name in zip(zip_code,Facility_city,Address,Name_list):
        #====Address Matching=====#
        choice=df1.loc[(df1['Zip']==pin) &(df1['City']==city),'Address1']
        result=fuzzy_match(Add,choice,fuzz.ratio,70)
        #====Name Matching========#
        if (result is not None):
            if (result[3]>70):
                choice_1=(df1.loc[(df1['Zip']==pin) &(df1['City']==city),'Name'])
                result_1=(fuzzy_match(Name,choice_1,fuzz.ratio,95))
                print(ID)
                if (result_1 is not None):
                    if(result_1[3]>95):
                       #Here populating the matching ID
                        print("ok")       


                    else:
                        continue
                else:
                    continue
            else:
                continue
        else:

1 个答案:

答案 0 :(得分:1)

IIUC:这是一个解决方案:

from fuzzywuzzy import fuzz
import pandas as pd

#Read raw data from clipboard
raw = pd.read_clipboard()

#Read map data from clipboard
mp = pd.read_clipboard()

#Merge raw data and mp data as following 
dfr = mp.merge(raw, on=['Hospital Name', 'City', 'Pincode'], how='outer')

#dfr will have many duplicate rows - eliminate duplicate
#To eliminate duplicate using toke_sort_ratio, compare address x and y
dfr['SCORE'] = dfr.apply(lambda x: fuzz.token_sort_ratio(x['Address_x'], x['Address_y']), axis=1)

#Filter only max ratio rows grouped by Address_x
dfr1 = dfr.iloc[dfr.groupby('Address_x').apply(lambda x: x['SCORE'].idxmax())]
#dfr1 shall have the desired result

link包含测试所提供解决方案的示例数据。