Question

我正在尝试将扫描的pdf转换为可读的pdf，并且我正在使用以下代码。首先，我将扫描的文档转换为图像并将其写回空白pdf。它为没有表格的pdf提供输出，但没有为包含表格的pdf创建任何图像。

from pdf2jpg import pdf2jpg
import pytesseract

source = "C://convertpdf//source"
destination = "C://convertpdf//dest"
pdf2jpg.convert_pdf2jpg(source, destination, pages="ALL")

text = pytesseract.image_to_pdf_or_hocr(image, lang='eng')
target_path = "C://pdfconvert//readblepdf//new.pdf"
with open(target_path, 'wb') as tmp_pdf:
    tmp_pdf.write(text)
tmp_pdf.close()

我想获取带有表格的pdf，并将其转换为图像，然后转换为可读的图像。 pdf2image中是否还有其他软件包或方法可以做到这一点？

Answer 1

您可以使用tesseract生成可搜索的pdf，如下所示：（确保路径中包含eng.traineddata）

ControlTemplate

Answer 2

pdf2jpg.convert_pdf2jpg（源，目标，页面=“ ALL”）

将PDF转换为图像

将扫描的pdf转换为可读的pdf

2 个答案: