Question

我正在尝试使用Tesseract从下面的图像中提取文本：

Tesseract的输出为：etiocsat” 如果我通过编辑图像手动删除刻度线（突出显示为黄色），则Tesseract将提供正确的文本。如何在python中使用opencv删除突出显示的部分？

Answer 1

您可以直接在Tesseract中过滤符号：

import pytesseract
from PIL import Image

text = pytesseract.image_to_string(Image.open('image.png'), lang='eng', config='-c tessedit_char_whitelist=ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789')

输出：

CTLDCBGT

Answer 2

您可以使用 findContours（） OpenCV 函数并删除基于区域的刻度。

首先必须将图像转换为二进制图像，然后将图像反转，然后根据面积去除较小的轮廓。以下是实现此任务的代码段：

var test = jfile.Data.filter(/*something*/);

输出：

import cv2
import pytesseract

im = cv2.imread("4SPb7.png")
# RGB to grayscale conversion
im_gray = cv2.cvtColor(im, cv2.COLOR_BGR2GRAY)

# grayscale to binary
_, im_bw = cv2.threshold(im_gray, 0, 255, cv2.THRESH_OTSU + cv2.THRESH_BINARY)

# invert image
im_bw = 255-im_bw

# find contours
_, cnts, hierarchy = cv2.findContours(im_bw, cv2.RETR_CCOMP, cv2.CHAIN_APPROX_SIMPLE)

# remove small components based on area
if cnts is not None:
    for i in range(0, len(cnts)):
        a = cv2.contourArea(cnts[i])
        if a < 30:
            cv2.drawContours(im_bw, cnts, i, 0, cv2.FILLED)

im_bw = 255-im_bw
print(pytesseract.image_to_string(im_bw))

输出图像：

使用opencv删除图像处理中不需要的刻度线

2 个答案: