我有一个名为“df_combo”的pandas数据框,其中包含“worker_id”,“url_entrance”,“company_name”列即可。我正在尝试生成一个输出列,告诉我“url_entrance”列中的网址是否包含“company_name”列中的任何字词。即使是像模糊一样的近距离比赛也会起作用。
例如,如果URL是“www.grandhotelseattle.com”而“company_name”是“Hotel Prestige Seattle”,那么模糊比可能在70-80之间。
我尝试过以下脚本: 的>>> fuzz.ratio(df_combo [ 'url_entrance'],df_combo [ 'COMPANY_NAME']) 但它只返回1个数字,这是整个列的整体模糊比率。我希望每行都有模糊比率,并将这些比率存储在新列中。
答案 0 :(得分:2)
Thanks everyone for your inputs. I have solved my problem! The link that "agg3l" provided was helpful. The "TypeError" I saw was because either the "url_entrance" or "company_name" has some floating types in certain rows. I converted both columns to string using the following scripts, re-ran the fuzz.ratio script and got it to work!
df_combo['url_entrance']=df_combo['url_entrance'].astype(str) df_combo['company_name']=df_combo['company_name'].astype(str)