使用python从PDF中提取扫描页面

时间:2018-05-26 15:48:44

标签: python pdf

我有很多PDF个文件,这些文件基本上是扫描文档,因此每个页面都是一个扫描图像。我想执行OCR并从这些文件中提取文本。我已经尝试了pytesseract,但它没有直接在OCR文件上执行pdf,所以作为一种解决方法,我想从images文件中提取PDF,将它们保存在目录中,然后直接在这些图像上使用OCR执行pytesseract。有没有办法从python中提取pdf文件中的扫描图像?或者有没有办法直接在pdf文件上执行OCR

1 个答案:

答案 0 :(得分:2)

此问题已在之前的Stack Overflow帖子中得到解决。

Converting PDF to images automatically
Converting a PDF to a series of images with Python

以下是可能有用的脚本:https://nedbatchelder.com/blog/200712/extracting_jpgs_from_pdfs.html

另一种方法:https://www.daniweb.com/programming/software-development/threads/427722/convert-pdf-to-image-with-pythonmagick

请在提问之前检查以前的帖子。

编辑:

包括工作脚本以供将来参考。程序适用于Windows上的Python3.6:

# coding=utf-8
# Extract jpg's from pdf's. Quick and dirty.

import sys

with open("Link/To/PDF/File.pdf", "rb") as file:
    pdf = file.read()

startmark = b"\xff\xd8"
startfix = 0
endmark = b"\xff\xd9"
endfix = 2
i = 0

njpg = 0
while True:
    istream = pdf.find(b"stream", i)
    if istream < 0:
        break
    istart = pdf.find(startmark, istream, istream + 20)
    if istart < 0:
        i = istream + 20
        continue
    iend = pdf.find(b"endstream", istart)
    if iend < 0:
        raise Exception("Didn't find end of stream!")
    iend = pdf.find(endmark, iend - 20)
    if iend < 0:
        raise Exception("Didn't find end of JPG!")

    istart += startfix
    iend += endfix
    print("JPG %d from %d to %d" % (njpg, istart, iend))
    jpg = pdf[istart:iend]
    with open("jpg%d.jpg" % njpg, "wb") as jpgfile:
        jpgfile.write(jpg)

    njpg += 1
    i = iend