Question

我查看了Stack Overflow和其他来源，似乎无法找到解决方案。我正在尝试创建一个Tesseract字体（我已经成功完成了这个）。但是，我希望它限制或至少高度影响所发现的单词是否接近字典中的单词。我有一个freq_words列表，其中包含我想要纠正的单词。例如，我的输出中有“tor”这个词，在我的词典中我有“for”这个词。无论我如何添加字典或lang.user_words文件，输出都没有变化。我想知道在这个过程中我是否遗漏了什么。

我的过程就是这样。

创建所有必需的字体/语言文件（ieinttemp，normproto，pffmtable，shapetable等）。
使用Wordlist2dawg获取单词列表/创建lang.freq-words
将这些文件与combine_tessdata连接，并将其添加到我的tessdata目录中，然后将其加载到tesseract中。
使用我的新语言运行tesseract

我的问题是，无论我将freq-words文件更改为什么，我的输出似乎都没有改变。我也尝试将“for”添加到lang.user-words（在tessdata中）但是再次没有变化。

语言文件：

eng1.freq-dawg (created from freq_words text file)
eng1.inttemp
eng1.pffmtable
eng1.normproto
eng1.shapetable
eng1.unicharset
eng1.[fontname].exp0.[box][tif][tr][txt]
font_properties
eng1.traineddata (output file that is put in the tessdata path)

配置文件：

load_system_dawg     T
load_freq_dawg       T
user_words_suffix    user-words

终端命令：

tesseract eng1.fontname.exp0.tif output -l eng1 /opt/local/share/tessdata/configs/config.txt

如果有人有任何建议，请告诉我。

Tesseract OCR字典匹配

0 个答案: