Question

我在Python中使用以下代码从图像中提取文本，

import cv2
import numpy as np
import pytesseract
from PIL import Image

# Path of working folder on Disk
src_path = "<dir path>"

def get_string(img_path):
    # Read image with opencv
    img = cv2.imread(img_path)

    # Convert to gray
    img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

    # Apply dilation and erosion to remove some noise
    kernel = np.ones((1, 1), np.uint8)
    img = cv2.dilate(img, kernel, iterations=1)
    img = cv2.erode(img, kernel, iterations=1)

    # Write image after removed noise
    cv2.imwrite(src_path + "removed_noise.png", img)

    #  Apply threshold to get image with only black and white
    #img = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 31, 2)

    # Write the image after apply opencv to do some ...

    cv2.imwrite(src_path + "thres.png", img)

    # Recognize text with tesseract for python
    result = pytesseract.image_to_string(Image.open(img_path))#src_path+ "thres.png"))

    # Remove template file
    #os.remove(temp)

    return result


print '--- Start recognize text from image ---'
print get_string(src_path + "test.jpg")

print "------ Done -------"

但是输出不正确..输入文件是，

收到的输出是＆＃39; 0001＆＃39;而不是＆＃39; D001＆＃39;

收到的输出是＆＃39; 3001＆＃39;而不是＆＃39; B001＆＃39;

从图像中检索正确的字符所需的代码更改是什么，也是为了训练pytesseract为图像中的所有字体类型返回正确的字符[包括粗体字符]

Answer 1

@Maaaaa指出了Tessearact错误文本识别的确切原因。

但是仍然可以通过在tesseract输出上应用一些后处理步骤来提高最终输出。以下是您可以考虑的几点，如果有帮助，请使用它们：

尝试禁用Tesseract输入参数中的字典检查功能。
使用数据集中基于启发式的信息。从给定的样本图像中，我猜每个单词/序列的第一个字符是一个字母表，因此您可以根据数据集替换输出中的第一个数字与最可能的字母表，例如，'0'可以用D代替'0001' - ＆gt; 'D001'，同样适用于其他情况。
Tesseract还提供了字符级别识别置信度值，因此请使用该信息将字符替换为具有最高置信度值的字符。

Answer 2

在下面的行中尝试不同的配置参数

result = pytesseract.image_to_string(Image.open(img_path))#src_path+ "thres.png"))

如下图所示：

result = pytesseract.image_to_string(Image.open(img_path))#src_path+ "thres.png"), config='--psm 1 --oem 3')

尝试更改psm值并比较结果

-祝你好运-

Python - Pytesseract从图像中提取不正确的文本

2 个答案: