Question

我希望使用PDFMiner提取在线提供的pdf文件内容。

我的代码基于documentation中用于提取硬盘上PDF文件内容的代码：

# Open a PDF file.
fp = open('mypdf.pdf', 'rb')
# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)
# Create a PDF document object that stores the document structure.
document = PDFDocument(parser)

对于一些小的变化，这很有效。

现在，我已尝试urllib2.openurl在线PDF，但这不起作用。我收到一条错误消息：coercing to Unicode: need string or buffer, instance found。

如何从urllib2.openurl获取字符串（或其他内容），以便它与我提供PDF文件名（相对于URL）时的open函数相同？？< / p>

如果我的问题不明确，请告诉我。

Answer 1

好吧，我终于找到了解决方案，

我使用Request和StringIO并取消open('my_file', 'rd')命令

from urllib2 import Request
from StringIO import StringIO

url = 'my_url'

open = urllib2.urlopen(Request(url)).read()
memoryFile = StringIO(open)

parser = PDFParser(memoryFile)

这样Python就会将url视为一个文件（就是这么说）。

使用PDFMiner（Python）和在线pdf文件。编码网址？

1 个答案: