打开pdf字节串进行读取

时间:2016-10-18 03:22:54

标签: python pdf

我正在研究scrapy蜘蛛,尝试使用pdfminer(https://pypi.python.org/pypi/pdfminer2)转换pdfs。我没有兴趣将实际的PDF保存到磁盘,因此我建议我在https://docs.python.org/2/library/io.html#buffered-streams查看io.bytesIO子类。基于Creating bytesIO object,我已经使用pdf正文初始化了bytesIO类,但是现在我需要打开数据并按照基本用法进行示例http://www.unixuser.org/~euske/python/pdfminer/programming.html到目前为止基于http://zevross.com/blog/2014/04/09/extracting-tabular-data-from-a-pdf-an-example-using-python-and-regular-expressions/我有:

    in_memory_pdf = BytesIO(bytes(response.body))
    in_memory_pdf.seek(0)

    rsrcmgr = PDFResourceManager()
    retstr = StringIO()

    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)

    fp = file(in_memory_pdf, 'rb')

    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)
    fp.close()
    device.close()
    st = retstr.getvalue()
    retstr.close()
    print st

当我跑步时,我得到:

fp = file(in_memory_pdf, 'rb')
TypeError: coercing to Unicode: need string or buffer, _io.BytesIO found

如何打开此pdf字节字符串进行处理?

在建议的更改之后我得到了:

2016-10-17 23:59:35 [root] DEBUG: exec: ET
2016-10-17 23:59:35 [root] DEBUG: nexttoken: (2819L, /'Q')
2016-10-17 23:59:35 [root] DEBUG: do_keyword: pos=2819L, token=/'Q', stack=[]
2016-10-17 23:59:35 [root] DEBUG: add_results: ((2819L, /'Q'),)
2016-10-17 23:59:35 [root] DEBUG: nextobject: (2819L, /'Q')
2016-10-17 23:59:35 [root] DEBUG: exec: Q

Traceback (most recent call last):
  File "C:\\site-packages\twisted\internet\defer.py", line 588, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "C:\j1\spiders\j1_spider.py", line 235, in parse_pdf_to_html
    interpreter.process_page(page)
  File "C:\\site-packages\pdfminer\pdfinterp.py", line 835, in process_page
    self.device.end_page(page)
  File "C:\\site-packages\pdfminer\converter.py", line 53, in end_page
    self.receive_layout(self.cur_item)
  File "C:\\site-packages\pdfminer\converter.py", line 206, in receive_layout
    render(ltpage)
  File "C:\\site-packages\pdfminer\converter.py", line 196, in render
    render(child)
  File "C:\\site-packages\pdfminer\converter.py", line 196, in render
    render(child)
  File "C:\\site-packages\pdfminer\converter.py", line 196, in render
    render(child)
  File "C:\\site-packages\pdfminer\converter.py", line 198, in render
    self.write_text(item.get_text())
  File "C:\\site-packages\pdfminer\converter.py", line 189, in write_text
    self.outfp.write(text)
TypeError: unicode argument expected, got 'str'

1 个答案:

答案 0 :(得分:1)

有两个问题:

  1. in_memory_pdf已经是str(或Py3中的bytes)的类似文件的对象,可以直接使用而无需打开。因此,将fp = file(in_memory_pdf, 'rb')更改为fp = in_memory_pdf部分有效。
  2. TextConverter的第二个参数也应该是str(或Py3中的bytes)的类似文件的对象。但问题中的retstr适用于unicode(或Py3中的str)。因此,retstr = StringIO()应更改为retstr = BytesIO()