我在一个列表中有大约500个项目的列表。我想用最小的项目替换该列表中所有模糊匹配的项目。
有没有办法加快模糊匹配的实施?
注意:我之前发过一个类似的问题,但由于缺乏回应,我正在重新编写它。
我的实施:
def find_fuzzymatch_samelist(list1, list2, cutoff=90):
"""
#list1 = list(ds1.Title)
#list2 = list(ds1.Title)
"""
matchdict = defaultdict(list)
for i, u in enumerate(list1):
for i1, u1 in enumerate(list2):
#Since list orders are the same, this makes sure this isn't the same item.
if i != i1:
if fuzz.partial_token_sort_ratio(u, u1) >= cutoff:
pair = (u, u1)
#Because there are potential duplicates, I have to make the key constant.
#Otherwise, putting list1 as the key will result in both duplicate items
#serving as the key.
"""
Potential problem:
• what if there are diffrent shortstr?
"""
shortstr = min(pair, key=len)
longstr = max(pair, key=len)
matchdict[shortstr].append(longstr)
return matchdict
答案 0 :(得分:2)
我假设您已经安装了python-Levenshtein,这将为您提供4倍的加速。
优化循环和字典访问:
def find_fuzzymatch_samelist(list1, list2, cutoff=90):
matchdict = dict()
for i1, i2 in itertools.permutations(range(len(list1), repeat=2)
u1 = list1[i1]
u2 = list2[i2]
if fuzz.partial_token_sort_ratio(u, u1) >= cutoff:
shortstr = min(u1, u2, key=len)
longstr = max(u1, u2, key=len)
matchdict.get(shortstr, list).append(longstr)
return matchdict
除了模糊调用之外,它的速度和它一样快。如果您阅读了源代码,您会看到在每次迭代中都会对每个字符串进行一些预处理。我们可以一次完成所有工作:
def _asciionly(s):
if PY3:
return s.translate(translation_table)
else:
return s.translate(None, bad_chars)
def full_pre_process(s, force_ascii=False):
s = _asciionly(s)
# Keep only Letters and Numbres (see Unicode docs).
string_out = StringProcessor.replace_non_letters_non_numbers_with_whitespace(s)
# Force into lowercase.
string_out = StringProcessor.to_lower_case(string_out)
# Remove leading and trailing whitespaces.
string_out = StringProcessor.strip(string_out)
out = ''.join(sorted(string_out))
out.strip()
return out
def find_fuzzymatch_samelist(list1, list2, cutoff=90):
matchdict = dict()
if list1 is not list2:
list1 = [full_pre_process(each) for each in list1]
list2 = [full_pre_process(each) for each in list2]
else:
# If you are comparing a list to itself, we don't want to overwrite content.
list1 = [full_pre_process(each) for each in list1]
list2 = list1
for i1, i2 in itertools.permutations(range(len(list1), repeat=2)
u1 = list1[i1]
u2 = list2[i2]
if fuzz.partial_ratio(u, u1) >= cutoff:
pair = (u1, u2)
shortstr = min(pair, key=len)
longstr = max(pair, key=len)
matchdict.get(shortstr, list).append(longstr)
return matchdict