我有一个城市名称列表,其中一些错误拼写:
['bercelona', 'emstrdam', 'Praga']
列出所有可能的城市名称的列表:
['New York', 'Amsterdam', 'Barcelona', 'Berlin', 'Prague']
我正在寻找能够找到第一个和第二个列表名称之间最接近匹配的算法,并返回带有拼写清晰名称的第一个列表。所以它应该返回以下列表:
['Barcelona', 'Amsterdam', 'Prague']
答案 0 :(得分:7)
您可以使用内置的Ratcliff和Obershelp算法:
def is_similar(first, second, ratio):
return difflib.SequenceMatcher(None, first, second).ratio() > ratio
first = ['bercelona', 'emstrdam', 'Praga']
second = ['New York', 'Amsterdam', 'Barcelona', 'Berlin', 'Prague']
result = [s for f in first for s in second if is_similar(f,s, 0.7)]
print result
['Barcelona', 'Amsterdam', 'Prague']
其中0.7是相似系数。它可能会对您的案例进行一些测试并设置此值。它显示了两个字符串的相似程度(1 - 它是相同的字符串,0 - 非常不同的字符串)
答案 1 :(得分:3)
首先,您应该使用字符串之间的Levenshtein距离,我找到了以下Levenshtein Distance Algorithm for Python的链接:
# Define Levenshtein distance function (from the mentioned link)
def levenshtein(s1, s2):
if len(s1) < len(s2):
return levenshtein(s2, s1)
if len(s2) == 0:
return len(s1)
previous_row = range(len(s2) + 1)
for i, c1 in enumerate(s1):
current_row = [i + 1]
for j, c2 in enumerate(s2):
insertions = previous_row[j + 1] + 1
deletions = current_row[j] + 1
substitutions = previous_row[j] + (c1 != c2)
current_row.append(min(insertions, deletions, substitutions))
previous_row = current_row
return previous_row[-1]
一旦你得到了这个,你应该使一个函数能够找到给定字符串与列表拼写清楚的列表之间最接近的匹配。
names_list = ['bercelona', 'emstrdam', 'Praga']
good_names = ['New York', 'Amsterdam', 'Barcelona', 'Berlin', 'Prague']
# Define a function that returns the best match
def get_closest_match(name, real_names):
levdist = [levenshtein(name, real_name) for real_name in real_names]
for i in range(len(levdist)):
if levdist[i] == min(levdist):
return real_names[i]
# Loops the first list
final_list = [ get_closest_match(name, good_names) for name in names_list ]
最后你只需用这个函数循环第一个列表。结果如下:
>>> print final_list
['Barcelona', 'Amsterdam', 'Prague']
答案 2 :(得分:3)
这可能是名为fuzzywuzzy的优秀软件包的一个很好的用例。
from fuzzywuzzy import fuzz
import numpy as np
bad = ['bercelona', 'emstrdam', 'Praga']
good = ['New York', 'Amsterdam', 'Barcelona', 'Berlin', 'Prague']
# you can even set custom threshold and only return matches if above certain
# matching threshold
def correctspell(word, spellcorrect, thresh = 70):
mtchs = map(lambda x: fuzz.ratio(x, word) if fuzz.ratio(x, word) > thresh else None, spellcorrect)
max = np.max(mtchs)
if max is not None:
return spellcorrect[mtchs.index(max)]
else:
return None
# get correct spelling
map(lambda x: correctspell(x, good, thresh = 70), bad) # ['Barcelona', 'Amsterdam', 'Prague']