2列之间的模糊匹配(Python)

时间:2016-10-20 00:36:58

标签: python python-3.x pandas fuzzywuzzy

我有一个名为“df_combo”的pandas数据框,其中包含“worker_id”“url_entrance”“company_name”列即可。我正在尝试生成一个输出列,告诉我“url_entrance”列中的网址是否包含“company_name”列中的任何字词。即使是像模糊一样的近距离比赛也会起作用。

例如,如果URL是“www.grandhotelseattle.com”而“company_name”是“Hotel Prestige Seattle”,那么模糊比可能在70-80之间。

我尝试过以下脚本: 的>>> fuzz.ratio(df_combo [ 'url_entrance'],df_combo [ 'COMPANY_NAME']) 但它只返回1个数字,这是整个列的整体模糊比率。我希望每行都有模糊比率,并将这些比率存储在新列中。

1 个答案:

答案 0 :(得分:2)

Thanks everyone for your inputs. I have solved my problem! The link that "agg3l" provided was helpful. The "TypeError" I saw was because either the "url_entrance" or "company_name" has some floating types in certain rows. I converted both columns to string using the following scripts, re-ran the fuzz.ratio script and got it to work!

df_combo['url_entrance']=df_combo['url_entrance'].astype(str) df_combo['company_name']=df_combo['company_name'].astype(str)