基于此链接,我试图在2个dfs之间进行模糊查找:Apply fuzzy matching across a dataframe column and save results in a new column:
import pandas as pd
df1 = pd.DataFrame(data={'Brand_var':['Johnny Walker','Guiness','Smirnoff','Vat 69','Tanqueray']})
df2 = pd.DataFrame(data={'Product':['J.Walker Blue Label 12 CC','J.Morgan Blue Walker','Giness blue 150 CC','tqry qiuyur qtre','v69 g nesscom ui123']})
我有2个dfs df1和df2,需要通过模糊查找/其他合适的方法进行映射。
以下是我正在使用的代码:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
compare = pd.MultiIndex.from_product([df1['Brand_var'],
df2['Product']]).to_series()
def metrics(tup):
return pd.Series([fuzz.ratio(*tup),
fuzz.token_sort_ratio(*tup)],
['ratio', 'token'])
compare.apply(metrics)
df = compare.apply(metrics).unstack().idxmax().unstack(0)
print(df)
下面是我的输出:
ratio token
----------------------------------------------------------
Giness blue 150 CC Guiness Guiness
J.Morgan Blue Walker Johnny Walker Johnny Walker
J.Walker Blue Label 12 CC Johnny Walker Johnny Walker
tqry qiuyur qtre Tanqueray Tanqueray
v69 g nesscom ui123 Guiness Guiness
预期输出:
ratio token
----------------------------------------------------------
Giness blue 150 CC Guiness Guiness
J.Morgan Blue Walker None None
J.Walker Blue Label 12 CC Johnny Walker Johnny Walker
tqry qiuyur qtre Tanqueray Tanqueray
v69 g nesscom ui123 Vat 69 Vat 69
有什么建议可以得到我想要的输出的更好的方法(不使用模糊模糊也可以)?
先谢谢您。 :)
答案 0 :(得分:1)
以下带有规则的代码将为您提供预期的输出:
import pandas as pd
from fuzzywuzzy import fuzz
df1 = pd.DataFrame(data={'Brand_var':['Johnny Walker','Guiness','Smirnoff','Vat 69','Tanqueray']})
df2 = pd.DataFrame(data={'Product':['J.Walker Blue Label 12 CC','J.Morgan Blue Walker','Giness blue 150 CC','tqry qiuyur qtre','v69 g nesscom ui123']})
Guiness_Beer = ["Giness","Guiness","Gines"]
Johnny_Walker = ["J.Walker","J.walker"]
Tanqueray =["tqry","Tanqueray","tquery"]
Vat = ["69","Vat69","Vat 69"]
matched_names = []
for row in df1.index:
brand_name = df2.get_value(row,"Product")
Rule_Guiness = any(word in brand_name for word in Guiness_Beer)
Rule_Johnny_Walker = any(word in brand_name for word in Johnny_Walker)
Rule_Tanqueray = any(word in brand_name for word in Tanqueray)
Rule_Vat = any(word in brand_name for word in Vat)
if Rule_Guiness:
matched_names.append([brand_name,"Guiness"])
elif Rule_Johnny_Walker:
matched_names.append([brand_name,"Johnny Walker"])
elif Rule_Tanqueray:
matched_names.append([brand_name,"Tanqueray"])
elif Rule_Vat:
matched_names.append([brand_name,"Vat 69"])
else:
matched_names.append([brand_name,"None"])
df = pd.DataFrame(columns=['Product', 'Brand'], data=matched_names)
您可以在其中进行更多修改,例如可以通过 excel 配置所有字典,例如 Guiness_beer 等,并且以后不必触摸代码您要添加/减去/修改任何关键字。