Question

我有一张图片，我无法将tesseract识别为文字。我的所有输入文本都是URL。

如您所见，图像尽可能清晰。

运行tesseract test2.png stdout时，它会返回http:II11111111111111111111111111111111111 1111111111111111111.coml哪个接近但不正确。

将tessedit_char_whitelist参数设置为htp:/1.com时，它会正确识别字符串（但我也想要更一般地识别网址）。

使用命令行tesseract test2.png stdout --user-patterns ./patterns.txt

传入如下图所示的模式文件

\n\*://\n\*
http://\n\*
\n\*.com

对识别没有帮助。它仍然优先I而不是/。（关于pattern file）

的更多细节

我还尝试将参数ok_repeated_ch_non_alphanum_wds设置为包含/（以及chs_trailing_punct{1,2}用于尾随/，但它似乎无法正常工作。指定{{1}也没帮助。（使用＆＃34;单词＆＃34;正在--user-words）

有没有办法为tesseract指定字符串优先级？

版本信息：

http://

Answer 1

您可以通过在unicharambigs中添加以下行来实现此目的文件：

3 : I I 3 : / / 1

使用修改过的训练数据文件输出：

$ tesseract test2.png stdout
http://11111111111111111111111111111111111
1111111111111111111.coml