Question

我将不得不在Python中执行类似拼写检查的操作，如下所示：

我有一个巨大的单词列表（让我们称之为词典）。我现在给了一些文本（我们称之为样本）。我必须在词典中搜索每个样本单词。如果我找不到它，那个样本字就是错误。

简而言之 - 蛮力拼写检查。但是，对每个样本字线性搜索词典必然会很慢。有什么更好的方法可以做到这一点？

复杂的因素是样本和词典都不是英文的。它是一种语言而不是26个字符，可以有300多个 - 以Unicode格式存储。

任何算法/数据结构/并行化方法的建议都会有所帮助。以低于100％的准确度为代价的高速算法将是完美的，因为我不需要100％的准确度。我知道Norvig的算法，但它似乎是英语特有的。

Answer 1

您可以使用一组Unicode字符串：

s = set(u"rabbit", u"lamb", u"calf")

并使用in运算符检查单词是否出现：

>>> u"rabbit" in s
True
>>> u"wolf" in s
False

这个查找本质上是O（1），因此字典的大小并不重要。

修改：以下是（区分大小写）拼写检查程序（2.6或更高版本）的完整代码：

from io import open
import re
with open("dictionary", encoding="utf-8") as f:
    words = set(line.strip() for line in f)
with open("document", encoding="utf-8") as f:
    for w in re.findall(r"\w+", f.read()):
        if w not in words:
            print "Misspelled:", w.encode("utf-8")

（print假定您的终端使用UTF-8。）

Answer 2

使用树结构存储单词，以便从根到叶的每个路径代表一个单词。如果您的遍历无法到达一片叶子，或者在单词结束前到达一片叶子，那么您的词典中就没有一个单词。

除了Emil在评论中提到的好处之外，还要注意这可以让你做回溯跟踪以找到其他拼写。

Answer 3

尝试使用套装，就像每个人都在告诉你的那样。经验丰富的程序员在python的C代码中优化了集合查找，因此你无法在小应用程序中做得更好。

Unicode不是问题：设置和字典键可以是unicode或英文文本，没关系。对你的唯一考虑可能是unicode规范化，因为不同的变音符号顺序不会相等。如果这是您的语言的问题，我将首先确保词典以标准化形式存储，然后在检查之前对每个单词进行标准化。例如，unicodedata.normalize('NFC', word)

Answer 4

这是sets到位的地方。在字典中创建一组所有单词，然后使用成员运算符检查单词是否出现在字典中。

这是一个简化的例子

>>> dictionary = {'Python','check-like', 'will', 'perform','follows:', 'spelling', 'operation'}
>>> for word in "I will have to perform a spelling check-like operation in Python as follows:".split():
    if word in dictionary:
        print "Found {0} in the dictionary".format(word)
    else:
        print "{0} not present in the dictionary".format(word)


I not present in the dictionary
Found will in the dictionary
have not present in the dictionary
to not present in the dictionary
Found perform in the dictionary
a not present in the dictionary
Found spelling in the dictionary
Found check-like in the dictionary
Found operation in the dictionary
in not present in the dictionary
Found Python in the dictionary
as not present in the dictionary
Found follows: in the dictionary
>>>

Answer 5

python字典中散列搜索的平均时间复杂度为O（1）。因此，您可以使用“没有值的字典”（a.k.a。一套）

Answer 6

这就是python词典和集合的用途！ :) 如果每个单词都有一些值（比如频率），可以将词典存储在字典中，如果只需要检查是否存在，则将其存储在字典中。搜索它们是O（1）所以它会很快。

lex = set(('word1', 'word2', .....))

for w in words:
    if w not in lex:
        print "Error: %s" % w

Answer 7

首先，您需要创建词典的索引。例如，您可以创建自己的索引系统，但更好的方法是使用全文搜索引擎Full text search engine 我可能会为你推荐apache lucene或sphinx。它既快速又开源。之后您可以从python向搜索引擎发送搜索查询并捕获回复。

Answer 8

这是我写的关于检查这些事情的帖子。它类似于谷歌建议/拼写检查工作。

http://blog.mattalcock.com/2012/12/5/python-spell-checker/

希望它有所帮助。

最快速的字典匹配

8 个答案: