Question

我正在尝试阅读以下类似的pdf文件。

到目前为止，我已通过以下步骤设法达到了95％的准确性：

使用魔杖将pdf转换为图像（我更喜欢使用pdf2image，但是我在Windows上并且无法安装poppler）
将每行分成左侧的数字和右侧的单词。
调整大小等于0.85。
应用阈值185。
将特定的训练数据用于0-9。，-来自here

通过此过程，它几乎可以完美读取几乎每个数字，但有时会混淆3s，5s和9s。

在完成所有此步骤后，我得到的图像类型如下

代码是：

dir_image = file
pdf = wi(filename=file, resolution=300)
pdfImage = pdf.convert("png")
page = wi(image = img)
page.save(filename = filename)
image = cv2.imread(filename, cv2.IMREAD_GRAYSCALE)
#Split into each row would go here but code is too long and doesn't matter
cropped_img_left = cv2.resize(cropped_img_left , None, fx=0.85, fy=0.85, interpolation=cv2.INTER_CUBIC)
ret, cropped_img_left = cv2.threshold(cropped_img_left, 185, 255, cv2.THRESH_BINARY)

我不知道如何达到100％的准确性。我得到的一些想法是：

也许可以使用其他工具来提高将pdf转换为png的质量？
使用其他与我的数字更相似的训练数据
更改tesseract的参数

但是我有点迷茫，我希望能得到一些指导。

非常感谢！

预处理图像Tesseract改进

0 个答案: