在python中搜索拼写检查器的哈希表

时间:2018-04-03 19:10:08

标签: python python-3.x

我有以下代码尝试使用哈希表算法拼写检查单词。它将要拼写检查的文件与作为字典的文件进行比较,并返回所有拼写错误的单词。我已经成功地通过二进制和线性搜索来执行此任务,但我发现这更具挑战性。

使用的字典是每行1个单词,测试文件是大量文本

import re


dictionary = {}
document = []
with open('dict.txt') as f:
for word in f:
    dictionary[word] = 1


with open("testfile.txt" , encoding="utf8") as f:
    content = f.read().split(" ")
    content = [item.lower() for item in content]
    content = ' '.join(content)
    content = re.findall("[a-z]+", content)
    for line in content:
           document.append(line)



for line in document:
    for word in line:
        if word not in dictionary:
            print("{} on line #{} is spelled wrong!".format(word, document.index(line)))

代码输出为:

t on line #0 is spelled wrong! h on line #0 is spelled wrong! e on line #0 is spelled wrong! c on line #1 is spelled wrong! o on line #1 is spelled wrong! m on line #1 is spelled wrong! p on line #1 is spelled wrong! l on line #1 is spelled wrong! e on line #1 is spelled wrong! t on line #1 is spelled wrong!

这是逐字拼写测试文件,因为你可以看到它说“完整”。它似乎将测试文件中的每个字母都视为一个不正确的单词。

1 个答案:

答案 0 :(得分:1)

这是你的问题:

for word in line:

迭代字符串会产生字符,而不是字符。使用变量名word不会改变它,因为Python不会查看变量名来找出你想要的东西。 word仍然是一个角色。

你想要更像的东西:

for word in line.split():

...虽然这将包括紧靠单词的标点符号。实际上,您可能需要一个与一行中的一个或多个单词字符匹配的正则表达式:

import re

# same as before, up to:
for word in re.findall(r"\w+", line):