我有以下代码尝试使用哈希表算法拼写检查单词。它将要拼写检查的文件与作为字典的文件进行比较,并返回所有拼写错误的单词。我已经成功地通过二进制和线性搜索来执行此任务,但我发现这更具挑战性。
使用的字典是每行1个单词,测试文件是大量文本
import re
dictionary = {}
document = []
with open('dict.txt') as f:
for word in f:
dictionary[word] = 1
with open("testfile.txt" , encoding="utf8") as f:
content = f.read().split(" ")
content = [item.lower() for item in content]
content = ' '.join(content)
content = re.findall("[a-z]+", content)
for line in content:
document.append(line)
for line in document:
for word in line:
if word not in dictionary:
print("{} on line #{} is spelled wrong!".format(word, document.index(line)))
代码输出为:
t on line #0 is spelled wrong!
h on line #0 is spelled wrong!
e on line #0 is spelled wrong!
c on line #1 is spelled wrong!
o on line #1 is spelled wrong!
m on line #1 is spelled wrong!
p on line #1 is spelled wrong!
l on line #1 is spelled wrong!
e on line #1 is spelled wrong!
t on line #1 is spelled wrong!
这是逐字拼写测试文件,因为你可以看到它说“完整”。它似乎将测试文件中的每个字母都视为一个不正确的单词。
答案 0 :(得分:1)
这是你的问题:
for word in line:
迭代字符串会产生字符,而不是字符。使用变量名word
不会改变它,因为Python不会查看变量名来找出你想要的东西。 word
仍然是一个角色。
你想要更像的东西:
for word in line.split():
...虽然这将包括紧靠单词的标点符号。实际上,您可能需要一个与一行中的一个或多个单词字符匹配的正则表达式:
import re
# same as before, up to:
for word in re.findall(r"\w+", line):