Python pdfminer LAParams混合文本输出

时间:2017-12-09 15:52:17

标签: python pdfminer

我有一个pdf文件,我想用pdfminer解析它的文本。问题是LAParams有时会失败并在最后给出一部分行。我无法弄清楚原因。我的pdf看起来像这样: pdf Out put看起来像这样: output 我的代码在这里,提前感谢:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec , laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    caching = True
    pagenos=set()

    for PageNumer,page in enumerate(PDFPage.get_pages(fp, pagenos , password=password,caching=caching, check_extractable=True)):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text
print(convert_pdf_to_txt('C:\\Users\\Vagos\\Desktop\\europe.pdf'))

1 个答案:

答案 0 :(得分:2)

自己找到答案。 LAParams()的word_margin默认为0.3。我的文件显然有时候会变大,导致问题。替换LAParams() 用LAParams(char_margin = 20)解决了这个问题。其他变量也见http://nullege.com/codes/search/pdfminer.layout.LAParams