考虑以下.txt文件:myfile.txt
:
Box-No.: DK10-95794
Total Discounts USD 1,360.80
Totat: usp 529.20
如您所见,在上面的文本文件中,有两个错误totat
和usp
(应为total
和usd
)
现在,我正在使用基于SymSpell构建的Python软件包,称为SymSpellPy。这样可以检查一个单词并确定其拼写是否正确。
这是我的Python脚本:
# maximum edit distance per dictionary precalculation
max_edit_distance_dictionary = 2
prefix_length = 7
# create object
sym_spell = SymSpell(max_edit_distance_dictionary, prefix_length)
# load dictionary
dictionary_path = os.path.join(
os.path.dirname(__file__), "Dictionaries/eng.dictionary.txt")
term_index = 0 # column of the term in the dictionary text file
count_index = 1 # column of the term frequency in the dictionary text file
with open("myfile.txt", "r") as file:
for line in file:
for word in re.findall(r'\w+', line):
# word by word
input_term = word
# max edit distance per lookup
max_edit_distance_lookup = 2
suggestion_verbosity = Verbosity.CLOSEST # TOP, CLOSEST, ALL
suggestions = sym_spell.lookup(input_term, suggestion_verbosity,
max_edit_distance_lookup)
# display suggestion term, term frequency, and edit distance
for suggestion in suggestions:
word = word.replace(input_term, suggestion.term)
print("{}, {}". format(input_term, word))
在我的文本文件上运行上述脚本,会得到以下输出结果:
Total, Total
USD, USD
Totat, Total
如您所见,它正确捕获了最后一个单词totat => total
。
我的问题是-如何找到拼写错误的单词并在txt文件中更正?