R中的tesseract包不识别任何字符

时间:2017-01-10 21:55:57

标签: r web-scraping imagemagick tesseract

我使用R,版本3.3.2。我试图使用新的tesseract包解析一些文本。图像看起来像这样:

Image

代码很简单:

library(tesseract)
engine <- tesseract(options = list(tessedit_char_whitelist = "0123456789abcdefghijklmnopqrstuvwxyz"))
text <- ocr("some_image_path.png", engine = engine)

结果是:

Too few characters. Skipping this page

为什么它不识别任何角色?

1 个答案:

答案 0 :(得分:1)

因为有Too few characters

似乎有a limit
const int kMinCharactersToTry = 50;

经过测试,在失败时返回错误

// If there are too few characters, skip this page entirely.
  if (real_max < kMinCharactersToTry / 2) {
    tprintf("Too few characters. Skipping this page\n");
    return 0;
  }

再次尝试使用超过25个字符的样本?