Question

我在Windows 7上使用Python 3.4并希望我可以使用PDFMiner从PDF文件中提取文本。但是，在我测试时丢失信息非常普遍。对于某些文件，它可能只是几句话的问题。但是我遇到了无法提取文本 half 的情况，具体取决于文件格式。这是我的完整代码：

import io
from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams


def convert_pdf(pdfFile, retstr):
    password = ''
    pagenos = set()
    maxpages = 0
    laparams = LAParams()
    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, retstr, laparams=laparams)
    process_pdf(rsrcmgr, device, pdfFile, pagenos, maxpages=maxpages, password=password, check_extractable=True)
    device.close()
    return retstr


def extract_pdf(file_name, language):
    pdfFile = open(file_name, 'rb')
    retstr = io.StringIO()
    retstr = convert_pdf(pdfFile, retstr)
    whole = retstr.getvalue()
    original_texts = whole.split('\n')
    pdfFile.close()
    return original_texts

我想知道是否有办法使用PDFMiner提取全文。我听说过poppler，但我似乎无法找到如何将它用作Python库。此外，我不想使用命令行。有人可以帮忙吗？

以下是一个例子：a thesis。使用上面的代码提取时会丢失几个段落。就像在第二页一样，我只能提取页面的前半部分，直到“Pereira，Tishby和Lee（1993）”在中间。然后它没有明显的原因直接跳到下一页。

使用PDFMiner从PDF中提取文本时丢失信息

0 个答案: