应用错误收集

2列之间的模糊匹配（Python）

时间：2016-10-20 00:36:58

标签： python python-3.x pandas fuzzywuzzy

我有一个名为“df_combo”的pandas数据框，其中包含“worker_id”，“url_entrance”，“company_name”列即可。我正在尝试生成一个输出列，告诉我“url_entrance”列中的网址是否包含“company_name”列中的任何字词。即使是像模糊一样的近距离比赛也会起作用。

例如，如果URL是“www.grandhotelseattle.com”而“company_name”是“Hotel Prestige Seattle”，那么模糊比可能在70-80之间。

我尝试过以下脚本：的＆GT;＆GT;＆GT; fuzz.ratio（df_combo [ 'url_entrance']，df_combo [ 'COMPANY_NAME']） 但它只返回1个数字，这是整个列的整体模糊比率。我希望每行都有模糊比率，并将这些比率存储在新列中。

1 个答案:

答案 0 :(得分：2)

Thanks everyone for your inputs. I have solved my problem! The link that "agg3l" provided was helpful. The "TypeError" I saw was because either the "url_entrance" or "company_name" has some floating types in certain rows. I converted both columns to string using the following scripts, re-ran the fuzz.ratio script and got it to work!

df_combo['url_entrance']=df_combo['url_entrance'].astype(str) df_combo['company_name']=df_combo['company_name'].astype(str)