有没有办法从通过Google应用引擎上传的PDF文件中提取文本和documentInfo?我想使用PyPDF2,我的代码是这样的:
pdf_file = self.request.POST['file'].file
pdf_reader = pypdf.PdfFileReader(pdf_file)
这给了我错误:
Traceback (most recent call last):
....
File "/myrepo/myproj/main.py", line 154, in post
pdf_text = pypdf.PdfFileReader(pdf_file)
File "lib/PyPDF2/pdf.py", line 649, in __init__
self.read(stream)
File "lib/PyPDF2/pdf.py", line 1100, in read
raise utils.PdfReadError, "EOF marker not found"
PdfReadError: EOF marker not found
它为任何文件提供此错误,即使是那些可以通过open(filename, 'r')
答案 0 :(得分:1)
解决方案是使用get_uploads
中的blobstore_handlers.BlobstoreUploadHandler
:
from google.appengine.ext.webapp import blobstore_handlers
from cStringIO import StringIO
import PyPDF2
class UploadHandler(blobstore_handlers.BlobstoreUploadHandler):
def post(self):
upload_files = self.get_uploads('file')
blob_info = upload_files[0]
blob_reader = blobstore.BlobReader(blob_info)
blob_content = StringIO(blob_reader.read())
pdf_info = PyPDF2.PdfFileReader(blob_content)