我有一个pdf文件,我想用pdfminer解析它的文本。问题是LAParams有时会失败并在最后给出一部分行。我无法弄清楚原因。我的pdf看起来像这样: Out put看起来像这样: 我的代码在这里,提前感谢:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec , laparams=laparams)
fp = open(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
caching = True
pagenos=set()
for PageNumer,page in enumerate(PDFPage.get_pages(fp, pagenos , password=password,caching=caching, check_extractable=True)):
interpreter.process_page(page)
text = retstr.getvalue()
fp.close()
device.close()
retstr.close()
return text
print(convert_pdf_to_txt('C:\\Users\\Vagos\\Desktop\\europe.pdf'))
答案 0 :(得分:2)
自己找到答案。 LAParams()的word_margin默认为0.3。我的文件显然有时候会变大,导致问题。替换LAParams() 用LAParams(char_margin = 20)解决了这个问题。其他变量也见http://nullege.com/codes/search/pdfminer.layout.LAParams