Question

我正在使用pytesser对一个小图像进行OCR并从中获取一个字符串：

image= Image.open(ImagePath)
text = image_to_string(image)
print text

然而，pytesser喜欢有时识别并返回非ascii字符。当我想要打印我刚认出的内容时，就会出现问题。在python 2.7（这是我正在使用的）中，程序崩溃了。

是否有某种方法可以使pytesser不返回任何非ascii字符？也许你可以在tesseract OCR中改变一些东西？

或者，是否有某种方法可以测试字符串中的非ascii字符（不会导致程序崩溃）然后只是不打印该行？

有些人会建议使用python 3.4但是从我的研究看起来它看起来像pytesser不能用它：Pytesser in Python 3.4: name 'image_to_string' is not defined?

Answer 1

我会选择Unidecode。该库将非ASCII字符转换为最相似的ASCII表示。

import unidecode
image = Image.open(ImagePath)
text = image_to_string(image)
print unidecode(text)

它应该完美无缺！

Answer 2

是否有某种方法可以使pytesser不返回任何非ascii字符？

您可以使用选项tessedit_char_whitelist来限制tesseract可识别的字符。

例如：

import string
char_whitelist = string.digits
char_whitelist += string.ascii_lowercase
char_whitelist += string.ascii_uppercase
image= Image.open(ImagePath)
text = image_to_string(image,
    config="-c tessedit_char_whitelist=%s_-." % char_whitelist)
print text

另请参阅：https://github.com/tesseract-ocr/tesseract/wiki/FAQ-Old#how-do-i-recognize-only-digits

图像到文本 - 删除python 2.7中的非ascii字符

2 个答案: