Question

我有一个远程PDF文件，我需要逐页阅读并继续将每个文件传递给OCR，这将给我OCR文本。

import pytesseract
from pyPdf import PdfFileWriter, PdfFileReader
import cStringIO
from wand.image import Image
import urllib2
import tempfile
import pytesseract
from PIL import Image

remoteFile = urllib2.urlopen(urllib2.Request("file:///home/user/Documents/TestDocs/test.pdf")).read()
memoryFile = cStringIO.StringIO(remoteFile)

pdfFile = PdfFileReader(memoryFile)
for pageNum in xrange(pdfFile.getNumPages()):
    currentPage = pdfFile.getPage(pageNum)

    ## somehow convert currentPage to wand type
    ## image and then pass to tesseract-api
    ##
    ## TEMP_IMAGE = some conversion to temp file
    ## pytesseract.image_to_string(Image.open(TEMP_IMAGE))

memoryFile.close()

我想过使用cStringIO或tempfile，但我无法弄清楚如何将它们用于此目的。

如何解决这个问题？

Answer 1

执行此操作有两个选项，根据您提供的代码，更兼容的方式是将图像临时存储在该目录中，然后在使用pytesseract读取文本后删除它们。我创建了一个魔杖类型的图像来单独从PDF中提取每个图像，然后将其转换为pytesseract的PIL类型图像。这是我用于此的代码，检测到的文本会写入数组＆＃39; text＆＃39;其中每个元素都是原始PDF中的图像，我还更新了一些导入，使其与Python3兼容（cStringIO-＆gt; io和urllib2-＆gt; urllib.request）。

import PyPDF2
import os
import pytesseract
from wand.image import Image
from PIL import Image as PILImage
import urllib.request
import io

with urllib.request.urlopen('file:///home/user/Documents/TestDocs/test.pdf') as response:
    pdf_read = response.read()
    pdf_im = PyPDF2.PdfFileReader(io.BytesIO(pdf_read))
    text = []
    for p in range(pdf_im.getNumPages()):
        with Image(filename='file:///home/user/Documents/TestDocs/test.pdf' + '[' + str(p) + ']') as img:
            with Image(image = img) as converted: #Need second with to convert SingleImage object from wand to Image
                converted.save(filename=tempFile_Location)
                text.append(pytesseract.image_to_string(PILImage.open(tempFile_Location)))
                os.remove(tempFile_Location)

或者，如果您想避免为每个图像创建和删除临时文件，可以使用numpy和OpenCV将图像提取为blob，将其转换为numpy数组，然后将其转换为PIL图像以便pytesseract为在（reference）

上执行OCR

import PyPDF2
import os
import pytesseract
from wand.image import Image
from PIL import Image as PILImage
import urllib.request
import io
import numpy as np
import cv2

with urllib.request.urlopen('file:///home/user/Documents/TestDocs/test.pdf') as response:
    pdf_read = response.read()
    pdf_im = PyPDF2.PdfFileReader(io.BytesIO(pdf_read))
    text = []
    for p in range(pdf_im.getNumPages()):
        with Image(filename=('file:///home/user/Documents/TestDocs/test.pdf') + '[' + str(p) + ']') as img:
            img_buffer=np.asarray(bytearray(img.make_blob()), dtype=np.uint8)
            retval = cv2.imdecode(img_buffer, cv2.IMREAD_GRAYSCALE)
            text.append(pytesseract.image_to_string(PILImage.fromarray(retval)))

将远程PDF的页面转换为OCR的临时图像

1 个答案: