for
循环非常昂贵。我正在构建一个校正算法,我使用了彼得诺维格的拼写纠正码。我对它进行了一些修改,并意识到在数千个单词上执行优化需要很长时间。
算法检查1和2编辑距离并进行纠正。我做到了3。这样可能会增加时间(我不确定)。以下是最高出现的单词作为参考的结尾的一部分:
def correct(word):
candidates = (known([word]).union(known(edits1(word)))).union(known_edits2(word).union(known_edits3(word)) or [word]) # this is where the problem is
candidate_new = []
for candidate in candidates: #this statement isnt the problem
if soundex(candidate) == soundex(word):
candidate_new.append(candidate)
return max(candidate_new, key=(NWORDS.get))
看起来语句for candidate in candidates
正在增加执行时间。您可以轻松查看彼得诺威的代码,点击here
我已经找到了问题所在。它在声明中
candidates = (known([word]).union(known(edits1(word)))
).union(known_edits2(word).union(known_edits3(word)) or [word])
其中,
def known_edits3(word):
return set(e3 for e1 in edits1(word) for e2 in edits1(e1)
for e3 in edits1(e2) if e3 in NWORDS)
可以看出edits3
内有3个for循环,这使得执行时间增加了3倍。 edits2
有2个for循环。所以这就是罪魁祸首。
如何最小化此表达式?
itertools.repeat
可以帮助解决这个问题吗?
答案 0 :(得分:2)
提高绩效的几种方法:
代码将减少为:
def correct(word):
candidates = (known([word]).union(known(edits1(word)))).union(known_edits2(word).union(known_edits3(word)) or [word])
# Compute soundex outside the loop
soundex_word = soundex(word)
# List compre
candidate_new = [candidate for candidate in candidates if soundex(candidate) == soundex_word]
# Or Generator. This will save memory
candidate_new = (candidate for candidate in candidates if soundex(candidate) == soundex_word)
return max(candidate_new, key=(NWORDS.get))
另一项增强是基于您只需要MAX候选
的事实def correct(word):
candidates = (known([word]).union(known(edits1(word)))).union(known_edits2(word).union(known_edits3(word)) or [word])
soundex_word = soundex(word)
max_candidate = None
max_nword = 0
for candidate in candidates:
if soundex(candidate) == soundex_word and NWORDS.get(candidate) > max_nword:
max_candidate = candidate
return max_candidate