我将一个公司名称列表与一个主列表进行了比较,并使用ngrams /余弦相似度与levenshtein距离进行了最接近的匹配。这给了我以下信息,其中我验证了哪些最接近的匹配是正确的,哪些是不正确的(0 =不正确,1 =正确):
import pandas as pd
data = {'Name_Raw':['AECOM TECHNICAL SERVICES', 'AECOM_*', 'AECOM- Amentum', 'AECOM GOVERNMENT SERVICES (Inactive)', 'ADT LLC dba ADT Security Services', 'ADT', 'AAA Call Center', 'AAA of Northern California, Nevada', 'ANHEUSER BUSCH InBev'], 'Name_CleanCorrect':['AECOM', 'AECOM', 'AECOM', 'AECOM', 'ADT SECURITY CORPORATION', 'ADT SECURITY CORPORATION', 'AAA', 'AAA', 'AB InBev'], 'Name_ngram':['AECOM', 'AECOM', 'AECOM', 'AECOM', 'ADT SECURITY CORPORATION', 'ADT SECURITY CORPORATION', 'AAA', 'State Bar of California', 'Ivanhoe Cambridge USA'], 'ngram similarity':[0.38, 1, 0.51, 0.33, 0.64, 0.41, 0.36, 0.30, 0.16], 'Name_Leven':['AECOM', 'AECOM', 'AECOM', 'AECOM', 'ADT SECURITY CORPORATION', 'ADT SECURITY CORPORATION', 'AAA', 'State Bar of California', 'AB InBev'], 'leven_similarity':[0.23, 1, 1, 0.21, 0.65, 0.85, 0.85, 0.37, 0.65], 'ngram_correct':[1, 1, 1, 1, 1, 1, 1, 0, 0], 'leven_correct':[1, 1, 1, 1, 1, 1, 1, 0, 1]}
df2 = pd.DataFrame(data)
print(df2)
我将如何计算分别用于两种方法的假阴性阈值不超过10%?即ngrams /余弦相似度阈值= 70%与levenshtein距离/模糊度= 90%?