我正在研究scrapy蜘蛛,尝试使用pdfminer(https://pypi.python.org/pypi/pdfminer2)转换pdfs。我没有兴趣将实际的PDF保存到磁盘,因此我建议我在https://docs.python.org/2/library/io.html#buffered-streams查看io.bytesIO子类。基于Creating bytesIO object,我已经使用pdf正文初始化了bytesIO类,但是现在我需要打开数据并按照基本用法进行示例http://www.unixuser.org/~euske/python/pdfminer/programming.html到目前为止基于http://zevross.com/blog/2014/04/09/extracting-tabular-data-from-a-pdf-an-example-using-python-and-regular-expressions/我有:
in_memory_pdf = BytesIO(bytes(response.body))
in_memory_pdf.seek(0)
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(in_memory_pdf, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
fp.close()
device.close()
st = retstr.getvalue()
retstr.close()
print st
当我跑步时,我得到:
fp = file(in_memory_pdf, 'rb')
TypeError: coercing to Unicode: need string or buffer, _io.BytesIO found
如何打开此pdf字节字符串进行处理?
在建议的更改之后我得到了:
2016-10-17 23:59:35 [root] DEBUG: exec: ET
2016-10-17 23:59:35 [root] DEBUG: nexttoken: (2819L, /'Q')
2016-10-17 23:59:35 [root] DEBUG: do_keyword: pos=2819L, token=/'Q', stack=[]
2016-10-17 23:59:35 [root] DEBUG: add_results: ((2819L, /'Q'),)
2016-10-17 23:59:35 [root] DEBUG: nextobject: (2819L, /'Q')
2016-10-17 23:59:35 [root] DEBUG: exec: Q
Traceback (most recent call last):
File "C:\\site-packages\twisted\internet\defer.py", line 588, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "C:\j1\spiders\j1_spider.py", line 235, in parse_pdf_to_html
interpreter.process_page(page)
File "C:\\site-packages\pdfminer\pdfinterp.py", line 835, in process_page
self.device.end_page(page)
File "C:\\site-packages\pdfminer\converter.py", line 53, in end_page
self.receive_layout(self.cur_item)
File "C:\\site-packages\pdfminer\converter.py", line 206, in receive_layout
render(ltpage)
File "C:\\site-packages\pdfminer\converter.py", line 196, in render
render(child)
File "C:\\site-packages\pdfminer\converter.py", line 196, in render
render(child)
File "C:\\site-packages\pdfminer\converter.py", line 196, in render
render(child)
File "C:\\site-packages\pdfminer\converter.py", line 198, in render
self.write_text(item.get_text())
File "C:\\site-packages\pdfminer\converter.py", line 189, in write_text
self.outfp.write(text)
TypeError: unicode argument expected, got 'str'
答案 0 :(得分:1)
有两个问题:
in_memory_pdf
已经是str
(或Py3中的bytes
)的类似文件的对象,可以直接使用而无需打开。因此,将fp = file(in_memory_pdf, 'rb')
更改为fp = in_memory_pdf
部分有效。TextConverter
的第二个参数也应该是str
(或Py3中的bytes
)的类似文件的对象。但问题中的retstr
适用于unicode
(或Py3中的str
)。因此,retstr = StringIO()
应更改为retstr = BytesIO()
。