Question

在我的应用程序中，我正在从包含用 -

分隔的数字和字母的图像中读取文本

例如1-TT88TY5-AD5G

然而，Tesseract忽略了 - 给了我1TT88TY5AD5G ..

如何强制它读取连字符..

这是我的初始代码..

Tesseract* tesseract = [[Tesseract alloc] initWithDataPath:@"tessdata" language:@"eng"];
                       [tesseract setVariableValue:@"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" forKey:@"tessedit_char_whitelist"];

Answer 1

我在这里猜测是因为我没有使用过Tesseract，但-不应该在白名单中吗？

[tesseract setVariableValue:@"-0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" forKey:@"tessedit_char_whitelist"];
                              ^

Answer 2

Tesseract无法准确识别您的需求。您必须在很长时间内测试tesseract，然后根据tesseract性能应用一些模式匹配。

看看它的回归而不是-。所以最好用'-`替换tesseract而不是-。

在你的情况下-被替换为.看起来不太好，因为你的whiteList字符串不包含任何.

[tesseract setVariableValue:@"-0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" forKey:@"tessedit_char_whitelist"];

您可以使用以下方法来确定哪个字符具有多少置信度值

  /** Returns the (average) confidence value between 0 and 100. */
  int MeanTextConf();
  /**
   * Returns all word confidences (between 0 and 100) in an array, terminated
   * by -1.  The calling function must delete [] after use.
   * The number of confidences should correspond to the number of space-
   * delimited words in GetUTF8Text.
   */
  int* AllWordConfidences();

Tesseract OCR忽略“ - ”

2 个答案: