将远程PDF的页面转换为OCR的临时图像

时间:2015-06-28 00:39:12

标签: python pdf wand python-tesseract

我有一个远程PDF文件,我需要逐页阅读并继续将每个文件传递给OCR,这将给我OCR文本。

import pytesseract
from pyPdf import PdfFileWriter, PdfFileReader
import cStringIO
from wand.image import Image
import urllib2
import tempfile
import pytesseract
from PIL import Image

remoteFile = urllib2.urlopen(urllib2.Request("file:///home/user/Documents/TestDocs/test.pdf")).read()
memoryFile = cStringIO.StringIO(remoteFile)

pdfFile = PdfFileReader(memoryFile)
for pageNum in xrange(pdfFile.getNumPages()):
    currentPage = pdfFile.getPage(pageNum)

    ## somehow convert currentPage to wand type
    ## image and then pass to tesseract-api
    ##
    ## TEMP_IMAGE = some conversion to temp file
    ## pytesseract.image_to_string(Image.open(TEMP_IMAGE))

memoryFile.close()

我想过使用cStringIOtempfile,但我无法弄清楚如何将它们用于此目的。

如何解决这个问题?

1 个答案:

答案 0 :(得分:1)

执行此操作有两个选项,根据您提供的代码,更兼容的方式是将图像临时存储在该目录中,然后在使用pytesseract读取文本后删除它们。我创建了一个魔杖类型的图像来单独从PDF中提取每个图像,然后将其转换为pytesseract的PIL类型图像。这是我用于此的代码,检测到的文本会写入数组' text'其中每个元素都是原始PDF中的图像,我还更新了一些导入,使其与Python3兼容(cStringIO-> io和urllib2-> urllib.request)。

import PyPDF2
import os
import pytesseract
from wand.image import Image
from PIL import Image as PILImage
import urllib.request
import io

with urllib.request.urlopen('file:///home/user/Documents/TestDocs/test.pdf') as response:
    pdf_read = response.read()
    pdf_im = PyPDF2.PdfFileReader(io.BytesIO(pdf_read))
    text = []
    for p in range(pdf_im.getNumPages()):
        with Image(filename='file:///home/user/Documents/TestDocs/test.pdf' + '[' + str(p) + ']') as img:
            with Image(image = img) as converted: #Need second with to convert SingleImage object from wand to Image
                converted.save(filename=tempFile_Location)
                text.append(pytesseract.image_to_string(PILImage.open(tempFile_Location)))
                os.remove(tempFile_Location)

或者,如果您想避免为每个图像创建和删除临时文件,可以使用numpy和OpenCV将图像提取为blob,将其转换为numpy数组,然后将其转换为PIL图像以便pytesseract为在(reference

上执行OCR
import PyPDF2
import os
import pytesseract
from wand.image import Image
from PIL import Image as PILImage
import urllib.request
import io
import numpy as np
import cv2

with urllib.request.urlopen('file:///home/user/Documents/TestDocs/test.pdf') as response:
    pdf_read = response.read()
    pdf_im = PyPDF2.PdfFileReader(io.BytesIO(pdf_read))
    text = []
    for p in range(pdf_im.getNumPages()):
        with Image(filename=('file:///home/user/Documents/TestDocs/test.pdf') + '[' + str(p) + ']') as img:
            img_buffer=np.asarray(bytearray(img.make_blob()), dtype=np.uint8)
            retval = cv2.imdecode(img_buffer, cv2.IMREAD_GRAYSCALE)
            text.append(pytesseract.image_to_string(PILImage.fromarray(retval)))