我对低对比度的相同文件的文本识别有问题。我正在使用PYTESSERACT和一些文件,像这样,完全没有回报我:https://github.com/tomcat-slf4j-logback/tomcat-slf4j-logback
我使用PyTesseract的LineBoxBuilder。在此之前,我将PDF转换为JPG:
def save_img_with_wand(self, pdfName, output):
with Img(filename=pdfName, resolution=300) as pic:
pic.compression_quality = 100
pic.background_color = Color("white")
pic.alpha_channel = 'remove'
pic.save(filename=output)
Linebox构建器:
def line_box_builder(self, img):
try:
return self.tool.image_to_string(
img,
lang=self.lang,
builder=pyocr.builders.LineBoxBuilder()
)
except pytesseract.pytesseract.TesseractError as t:
self.Log.error('Tesseract ERROR : ' + str(t))
如果未找到任何内容,我将使用OpenCV改进检测:
@staticmethod
def improve_image_detection(img):
src = cv2.imread(img, cv2.IMREAD_GRAYSCALE)
dst = cv2.adaptiveThreshold(src, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY,11, 2)
cv2.imwrite(img, dst)
我尝试了多种OpenCV解决方案,但是在所有情况下,我都无法像上图那样在浅色背景上阅读文本
预先感谢您的帮助