我有一个远程PDF文件,我需要逐页阅读并继续将每个文件传递给OCR,这将给我OCR文本。
import pytesseract
from pyPdf import PdfFileWriter, PdfFileReader
import cStringIO
from wand.image import Image
import urllib2
import tempfile
import pytesseract
from PIL import Image
remoteFile = urllib2.urlopen(urllib2.Request("file:///home/user/Documents/TestDocs/test.pdf")).read()
memoryFile = cStringIO.StringIO(remoteFile)
pdfFile = PdfFileReader(memoryFile)
for pageNum in xrange(pdfFile.getNumPages()):
currentPage = pdfFile.getPage(pageNum)
## somehow convert currentPage to wand type
## image and then pass to tesseract-api
##
## TEMP_IMAGE = some conversion to temp file
## pytesseract.image_to_string(Image.open(TEMP_IMAGE))
memoryFile.close()
我想过使用cStringIO
或tempfile
,但我无法弄清楚如何将它们用于此目的。
如何解决这个问题?
答案 0 :(得分:1)
执行此操作有两个选项,根据您提供的代码,更兼容的方式是将图像临时存储在该目录中,然后在使用pytesseract读取文本后删除它们。我创建了一个魔杖类型的图像来单独从PDF中提取每个图像,然后将其转换为pytesseract的PIL类型图像。这是我用于此的代码,检测到的文本会写入数组' text'其中每个元素都是原始PDF中的图像,我还更新了一些导入,使其与Python3兼容(cStringIO-> io和urllib2-> urllib.request)。
import PyPDF2
import os
import pytesseract
from wand.image import Image
from PIL import Image as PILImage
import urllib.request
import io
with urllib.request.urlopen('file:///home/user/Documents/TestDocs/test.pdf') as response:
pdf_read = response.read()
pdf_im = PyPDF2.PdfFileReader(io.BytesIO(pdf_read))
text = []
for p in range(pdf_im.getNumPages()):
with Image(filename='file:///home/user/Documents/TestDocs/test.pdf' + '[' + str(p) + ']') as img:
with Image(image = img) as converted: #Need second with to convert SingleImage object from wand to Image
converted.save(filename=tempFile_Location)
text.append(pytesseract.image_to_string(PILImage.open(tempFile_Location)))
os.remove(tempFile_Location)
或者,如果您想避免为每个图像创建和删除临时文件,可以使用numpy和OpenCV将图像提取为blob,将其转换为numpy数组,然后将其转换为PIL图像以便pytesseract为在(reference)
上执行OCRimport PyPDF2
import os
import pytesseract
from wand.image import Image
from PIL import Image as PILImage
import urllib.request
import io
import numpy as np
import cv2
with urllib.request.urlopen('file:///home/user/Documents/TestDocs/test.pdf') as response:
pdf_read = response.read()
pdf_im = PyPDF2.PdfFileReader(io.BytesIO(pdf_read))
text = []
for p in range(pdf_im.getNumPages()):
with Image(filename=('file:///home/user/Documents/TestDocs/test.pdf') + '[' + str(p) + ']') as img:
img_buffer=np.asarray(bytearray(img.make_blob()), dtype=np.uint8)
retval = cv2.imdecode(img_buffer, cv2.IMREAD_GRAYSCALE)
text.append(pytesseract.image_to_string(PILImage.fromarray(retval)))