如何从合并的pdf文件中提取文本?

时间:2018-07-23 09:29:15

标签: python pypdf2

我编写了一些代码来合并目录中的一堆pdf文件并提取文本,但是,该代码不起作用。

#Merge all pdf files in the parent directory.
merger =  PyPDF2.PdfFileMerger()
for file in [a for a in os.listdir() if a.endswith(".pdf")]: merger.append(open(file, 'rb'))

#Now extract text from the merged object.
pdf_reader = PyPDF2.PdfFileReader(merger)
for page in pdf_reader.pages:
    #For each page, get corpus of text.
    for line in page.extractText().splitlines(): corpus += line

发生以下错误AttributeError: 'PdfFileMerger' object has no attribute 'seek'

我不希望输出合并的文件然后对其进行处理-那会很慢并且容易出错。如何将合并的对象直接用于PdfFileReader

1 个答案:

答案 0 :(得分:0)

您可以将合并的文件“输出”到内存中的IO对象,而无需写入磁盘,然后处理该伪文件。另外,您的循环会在许多您以后不会open使用的文件上使用close,这是非常糟糕的做法,并可能导致问题。

from StringIO import StringIO # import for IO objects

#Merge all pdf files in the parent directory.
merger =  PyPDF2.PdfFileMerger()
for file in [a for a in os.listdir() if a.endswith(".pdf")]:
    with open(file, 'rb') as f: # this will also close the file once it's done
        merger.append(f)


#Merge the files and "output" to an IO object instead of a file
tmp = StringIO()
merger.write(tmp)

#Now extract text from the merged object.
pdf_reader = PyPDF2.PdfFileReader(tmp)
for page in pdf_reader.pages:
    #For each page, get corpus of text.
    for line in page.extractText().splitlines(): corpus += line
相关问题