Question

有人试图让数字只在python中调用最新版本的tesseract 4.0吗？

以下工作在3.05但仍然返回4.0中的字符，我尝试删除所有配置文件但数字文件仍然没有工作;任何帮助都会很棒：

im是日期，黑色文字白色背景的图像：

echo $queryBuilder->innerJoin(..)...->getQuery()->getDQL();

Answer 1

您可以将tessedit_char_whitelist中的数字指定为config option。

ocr_result = pytesseract.image_to_string(image, lang='eng', boxes=False, \
           config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')

希望得到这个帮助。

Answer 2

您可以在this GitHub issue中看到，黑名单和白名单不适用于tesseract 4.0版。

如我在this blog article中所述，有3种可能的解决方案：

将tesseract更新为版本> 4.1
使用the answer中@thewaywewere中所述的旧版模式

创建一个使用简单的正则表达式提取所有数字的python函数：

def replace_chars(text):
    list_of_numbers = re.findall(r'\d+', text)
    result_number = ''.join(list_of_numbers)
    return result_number

result_number = pytesseract.image_to_string(im)

Answer 3

您可以在下面的tessedit_char_whitelist中指定数字作为配置选项。

ocr_result = pytesseract.image_to_string(image, lang='eng',config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')

Answer 4

在pytesseract中使用tessedit_char_whitelist标志对我不起作用。但是，一种解决方法是使用有效的标志，即config ='digits'：

import pytesseract
text = pytesseract.image_to_string(pixels, config='digits')

其中像素是图像的Numpy数组（PIL图像也应起作用）。这应该迫使您的pytesseract只返回数字。现在，要自定义返回的内容，请在以下位置找到您的数字配置文件：

C：\ Program Files（x86）\ Tesseract-OCR \ tessdata \ configs

打开数字文件，然后添加所需的任何字符。保存并运行pytesseract后，它将仅返回那些自定义字符。

pytesseract使用tesseract 4.0数字只能不工作

4 个答案: