如何使用PyPDF2从Google App Engine中上传的PDF中提取文本?

时间:2014-01-12 20:19:14

标签: google-app-engine python-2.7 pypdf

有没有办法从通过Google应用引擎上传的PDF文件中提取文本和documentInfo?我想使用PyPDF2,我的代码是这样的:

pdf_file = self.request.POST['file'].file
pdf_reader = pypdf.PdfFileReader(pdf_file)

这给了我错误:

Traceback (most recent call last):
....
  File "/myrepo/myproj/main.py", line 154, in post
    pdf_text = pypdf.PdfFileReader(pdf_file)
  File "lib/PyPDF2/pdf.py", line 649, in __init__
    self.read(stream)
  File "lib/PyPDF2/pdf.py", line 1100, in read
    raise utils.PdfReadError, "EOF marker not found"
PdfReadError: EOF marker not found

它为任何文件提供此错误,即使是那些可以通过open(filename, 'r')

成功从磁盘上的文件读取的文件 我错过了什么吗?提前谢谢!

1 个答案:

答案 0 :(得分:1)

解决方案是使用get_uploads中的blobstore_handlers.BlobstoreUploadHandler

from google.appengine.ext.webapp import blobstore_handlers
from cStringIO import StringIO
import PyPDF2

class UploadHandler(blobstore_handlers.BlobstoreUploadHandler):
    def post(self):
        upload_files = self.get_uploads('file')
        blob_info = upload_files[0]
        blob_reader = blobstore.BlobReader(blob_info)
        blob_content = StringIO(blob_reader.read())
        pdf_info = PyPDF2.PdfFileReader(blob_content)