如何提高pytesseract的准确率?

时间:2021-07-24 17:17:50

标签: python opencv image-processing ocr python-tesseract

几天前我开始了一个 ocr 项目。输入图像是带有白色字母的非常嘈杂的灰色图像。使用 EAST 文本检测器可以识别文本并在周围绘制边框。 之后我裁剪矩形做一些图像处理。之后,我将处理后的部分传递给 pytesseract,但结果很糟糕。图像和源视频如下。也许有些人有更好的图像处理和/或 pytesseract 设置的好主意。

图片
Input image
Rectangles after Recognition
First part
Second part
Third part

正方体结果 AY U N74 O54

图像处理的源代码

    kernel = cv2.getStructuringElement(cv2.MORPH_RECT , (8,8))
    kernel2 = np.ones((3,3),np.uint8)
    kernel3 = np.ones((5,5),np.uint8)
    gray = cv2.cvtColor(cropped, cv2.COLOR_BGR2GRAY)
    gray = cv2.resize(gray, None, fx=7, fy=7)
    gray = cv2.GaussianBlur(gray, (5,5), 1) 
   #cv2.medianBlur(gray, 5)
    gray = cv2.dilate(gray, kernel3, iterations = 1)
    gray = cv2.erode(gray, kernel3, iterations = 1)
    gray = cv2.morphologyEx(gray, cv2.MORPH_DILATE, kernel3)
    gray = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
    gray = cv2.bitwise_not(gray)
    ts_img = Image.fromarray(gray)
    txt = pytesseract.image_to_string(ts_img, config='--oem 3 --psm 12 -c tessedit_char_whitelist=12345678ABCDEFGHIJKLMNOPQRSTUVWXYZ load_system_dawg=false load_freq_dawg=false')

我尝试了一些其他 psm 设置,例如 psm 11、psm 8 和 ps6。结果不同,但也很糟糕。 我想最大的问题是与字母和数字相关的黑点,但我不知道如何去除它们。 我感谢每一个帮助:)

1 个答案:

答案 0 :(得分:1)

在将此文本解释为单词或句子时,OCR 软件的性能会很差,因为它需要的是真正的英语单词,而不是随机的字符组合。我建议将文本作为单个字符进行分析。我通过首先根据组的大小和位置确定哪些标记像素组(阈值图像的连接组件)是字符来解决(示例)问题。然后对于包含(单个)字符的每个图像部分,我使用 easyocr 来获取字符。我发现 pytesseract 在单个字符上表现不佳或根本没有表现(即使设置 --psm 10 和其他参数)。下面的代码产生这个结果:

OCR out: 6UAE005X0721295

ocr on individual characters

import cv2
import matplotlib.pyplot as plt
import numpy as np
import easyocr
reader = easyocr.Reader(["en"])

# Threshold image and determine connected components
img_bgr = cv2.imread("C5U3m.png")
img_gray = cv2.cvtColor(img_bgr[35:115, 30:], cv2.COLOR_BGR2GRAY)
ret, img_bin = cv2.threshold(img_gray, 195, 255, cv2.THRESH_BINARY_INV)
retval, labels = cv2.connectedComponents(255 - img_bin, np.zeros_like(img_bin), 8)
fig, axs = plt.subplots(4)
axs[0].imshow(img_gray, cmap="gray")
axs[0].set_title("grayscale")
axs[1].imshow(img_bin, cmap="gray")
axs[1].set_title("thresholded")
axs[2].imshow(labels, vmin=0, vmax=retval - 1, cmap="tab20b")
axs[2].set_title("connected components")

# Find and process individual characters
OCR_out = ""
all_img_chars = np.zeros((labels.shape[0], 0), dtype=np.uint8)
labels_xmin = [np.argwhere(labels == i)[:, 1].min() for i in range(0, retval)]
# Process the labels (connected components) from left to right
for i in np.argsort(labels_xmin):
    label_yx = np.argwhere(labels == i)
    label_ymin = label_yx[:, 0].min()
    label_ymax = label_yx[:, 0].max()
    label_xmin = label_yx[:, 1].min()
    label_xmax = label_yx[:, 1].max()
    # Characters are large blobs that don't border the top/bottom edge
    if label_yx.shape[0] > 250 and label_ymin > 0 and label_ymax < labels.shape[0]:
        img_char = img_bin[:, label_xmin - 3 : label_xmax + 3]
        all_img_chars = np.hstack((all_img_chars, img_char))
        # Use EasyOCR on single char (pytesseract performs poorly on single characters)
        OCR_out += reader.recognize(img_char, detail=0)[0]
axs[3].imshow(all_img_chars, cmap="gray")
axs[3].set_title("individual characters")
fig.show()

print("Thruth:  6UAE005X0721295")
print("OCR out: " + OCR_out)