Question

我正在尝试建立一个基于搜索的结果，在该结果中，我将有一个包含一行的输入数据框，并且我想与另一具有近一百万行的数据框进行比较。我正在使用名为Record Linkage

的软件包

但是，我无法处理错别字。假设我的原始数据中包含“ HSBC”，并且用户将其键入“ HKSBC”，我只想返回“ HSBC”结果。通过将字符串相似距离与jarowinkler进行比较，我得到以下结果：

from pyjarowinkler import distance
distance.get_jaro_distance("hksbc", "hsbc", winkler=True, scaling=0.1)
>> 0.94

但是，我无法提供“ HSBC”作为输出，因此我想在我的pandas数据框中创建一个新列，在该列中，我将计算字符串相似性分数并将该分数的一部分得分超过特定阈值。

此外，主要瓶颈是我拥有将近100万个数据，因此我需要非常快速地计算它。

P.S。我无意使用fuzzywuzzy，最好使用Jaccard或Jaro-Winkler

P.P.S。处理基于搜索的内容的错别字的任何其他想法也是可以接受的

Answer 1

我只能通过记录链接来解决它。因此，基本上，它会进行初始索引并生成候选链接（有关更多信息，您可以参考“ SortedNeighbourhoodindexing”的文档），即，它在需要比较的两个数据帧之间进行了多索引，这是我手动完成的。 / p>

这是我的代码：

import recordlinkage

df['index'] = 1 # this will be static since I'll have only one input value
df['index_2'] = range(1, len(df)+1)

df.set_index(['index', 'index_2'], inplace=True)

candidate_links=df.index

df.reset_index(drop=True, inplace=True)
df.index = range(1, len(df)+1)

# once the candidate links has been generated you need to reset the index and compare with the input dataframe which basically has only one static index, i.e. 1

compare_cl = recordlinkage.Compare()
compare_cl.string('Name', 'Name', label='Name', method='jarowinkler') # 'Name' is the column name which is there in both the dataframe

features = compare_cl.compute(candidate_links,df_input,df) # df_input is the i/p df having only one index value since it will always have only one row
print(features)
                       Name
index   index_2 
  1      13446       0.494444
         13447       0.420833
         13469       0.517949

现在我可以给这样一个过滤器：

features = features[features['Name'] > 0.9] # setting the threshold which will filter away my not-so-close names.

然后

df = df[df['index'].isin(features['index_2'])

这将对我的结果进行排序，并给我最终的数据框，其名称得分大于用户设置的特定阈值。

返回两个字符串列之间的字符串相似度得分-熊猫

1 个答案: