Question

我正在使用 Tesseract 3.05.01 Windows 从包含少量行的图片中提取文字。线条由圆角矩形包围。 [Image attached for reference]

Tesseract在开头检测到圆角矩形为“C”并且“＆gt;”在行尾。

这就是Tesseract的回报：

The Richter scale is used for measuring the
magnitude of which natural phenomenon?

C Earthquake >
C Hurricane >
C Tsunami

我试过包含“＆gt;”在黑名单中，但列入黑名单的符号会被类似的替换。所以我认为如果有一个选项只提取相似大小的字符，那么 避免形状 。

有没有办法只检测相似字体大小/高度的行？或 建议我解决此问题的任何方法。

Answer 1

您可以使用白名单代替黑名单，该名单包括您要拥有的所有字母！例如在tesseract.js中，该ist：

tessedit_char_whitelist: "abcdefghijklmnop ...."