我想使用PDFminer将文本从PDF提取到.text文件。我找到了代码,但我不知道如何使用它

时间:2016-05-21 21:32:31

标签: python python-2.7 pdfminer

这是我在这里找到的代码。我不知道如何使用它。有人可以告诉我这个并帮我转换样本pdf吗?

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages,   password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text

2 个答案:

答案 0 :(得分:2)

如果您使用pdfminer并使用其页面中的代码并阅读其文档https://www.binpress.com/tutorial/manipulating-pdfs-with-python/167

from cStringIO import StringIO
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage

def convert(fname, pages=None):
    if not pages:
        pagenums = set()
    else:
        pagenums = set(pages)

    output = StringIO()
    manager = PDFResourceManager()
    converter = TextConverter(manager, output, laparams=LAParams())
    interpreter = PDFPageInterpreter(manager, converter)

    infile = file(fname, 'rb')
    for page in PDFPage.get_pages(infile, pagenums):
        interpreter.process_page(page)
    infile.close()
    converter.close()
    text = output.getvalue()
    output.close
    return text

我不认为你应该在使用时遇到任何麻烦:

def convert(fname,pages = None):基本上为你转换pdf

使用如下:

some_variable = convert("filename.pdf") 
print(some_variable)
#do something with your variable

使用您的示例pdf: enter image description here

答案 1 :(得分:-1)

最后我找到了解决方法。最好的库是PDfminer,对pdf2txt.py的修改很少,无法有效使用。 pdf2text.py位于pdfminer / tools

在终端上安装PDfminer

ret $0x4
test_callee5()