Question

我正在尝试编写一个函数，通过不同的n-gram和距离度量，通过相同的第一个字母'找到一个单词（可能拼错了）的最接近的拼写。

我现在有什么

from nltk.corpus import words
from nltk import ngrams
from nltk.metrics.distance import edit_distance, jaccard_distance
first_letters = ['A','B','C']
spellings = words.words()
    def recommendation(word):
        n = 3
# n means 'n'-grams, here I use 3 as an example 
        spellings_new = [w for w in spellings if (w[0] in first_letters)]
        dists = [________(set(ngrams(word, n)), set(ngrams(w, n))) for w in spellings_new]
# ______ is the distance measure
        return spellings_new[dists.index(min(dists))]

其余的似乎很简单，但我不知道如何指定“相同的首字母”条件。特别是，如果拼写错误的单词以字母'A'开头，那么从带有拼写错误单词的最小距离度量的'.words'推荐的校正单词也应以'A'开头。等等等等。正如你从上面的功能块中看到的那样，我使用'（first_letters中的w [0]）'作为我的'首字母条件'，但是这并没有做到这一点，并且总是返回以不同的首字母开头的字母。我还没有在这个板上找到类似的线程来解决我的问题，如果有人能够指导我如何指定'首字母条件'，我将不胜感激。如果以前以某种方式询问过这个问题并认为不合适，我会删除它。

谢谢。

Answer 1

你真的很亲密。 %LOCALAPPDATA%/Microsoft/VisualStudio/15.0_ba2c3fe6/ComponentModelCache可用于检查第一个字母是否相同。之后w[0] == word[0]和set(w)可用于将单词更改为字母集。然后我将它传递给jaccard_distance，只是因为那是你已导入的内容。这可能是一个更好的解决方案。

set(word)

通过不同的距离测量，通过相同的第一个字母找到最接近的拼写

1 个答案: