这是我在这里找到的代码。我不知道如何使用它。有人可以告诉我这个并帮我转换样本pdf吗?
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
fp.close()
device.close()
retstr.close()
return text
答案 0 :(得分:2)
如果您使用pdfminer并使用其页面中的代码并阅读其文档https://www.binpress.com/tutorial/manipulating-pdfs-with-python/167:
from cStringIO import StringIO
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
def convert(fname, pages=None):
if not pages:
pagenums = set()
else:
pagenums = set(pages)
output = StringIO()
manager = PDFResourceManager()
converter = TextConverter(manager, output, laparams=LAParams())
interpreter = PDFPageInterpreter(manager, converter)
infile = file(fname, 'rb')
for page in PDFPage.get_pages(infile, pagenums):
interpreter.process_page(page)
infile.close()
converter.close()
text = output.getvalue()
output.close
return text
我不认为你应该在使用时遇到任何麻烦:
def convert(fname,pages = None):基本上为你转换pdf
使用如下:
some_variable = convert("filename.pdf")
print(some_variable)
#do something with your variable
答案 1 :(得分:-1)
最后我找到了解决方法。最好的库是PDfminer,对pdf2txt.py的修改很少,无法有效使用。 pdf2text.py位于pdfminer / tools
中在终端上安装PDfminer
ret $0x4
test_callee5()