Question

我对低对比度的相同文件的文本识别有问题。我正在使用PYTESSERACT和一些文件，像这样，完全没有回报我：https://github.com/tomcat-slf4j-logback/tomcat-slf4j-logback

我使用PyTesseract的LineBoxBuilder。在此之前，我将PDF转换为JPG：

def save_img_with_wand(self, pdfName, output):
    with Img(filename=pdfName, resolution=300) as pic:
        pic.compression_quality = 100
        pic.background_color    = Color("white")
        pic.alpha_channel       = 'remove'
        pic.save(filename=output)

Linebox构建器：

def line_box_builder(self, img):
    try:
        return self.tool.image_to_string(
            img,
            lang=self.lang,
            builder=pyocr.builders.LineBoxBuilder()
        )

    except pytesseract.pytesseract.TesseractError as t:
        self.Log.error('Tesseract ERROR : ' + str(t))

如果未找到任何内容，我将使用OpenCV改进检测：

@staticmethod
def improve_image_detection(img):
    src     = cv2.imread(img, cv2.IMREAD_GRAYSCALE)
    dst     = cv2.adaptiveThreshold(src, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY,11, 2)
    cv2.imwrite(img, dst)

我尝试了多种OpenCV解决方案，但是在所有情况下，我都无法像上图那样在浅色背景上阅读文本

预先感谢您的帮助

改善灰度图像上的Tessract检测

0 个答案: