我编写了一些代码来合并目录中的一堆pdf文件并提取文本,但是,该代码不起作用。
#Merge all pdf files in the parent directory.
merger = PyPDF2.PdfFileMerger()
for file in [a for a in os.listdir() if a.endswith(".pdf")]: merger.append(open(file, 'rb'))
#Now extract text from the merged object.
pdf_reader = PyPDF2.PdfFileReader(merger)
for page in pdf_reader.pages:
#For each page, get corpus of text.
for line in page.extractText().splitlines(): corpus += line
发生以下错误AttributeError: 'PdfFileMerger' object has no attribute 'seek'
我不希望输出合并的文件然后对其进行处理-那会很慢并且容易出错。如何将合并的对象直接用于PdfFileReader
?
答案 0 :(得分:0)
您可以将合并的文件“输出”到内存中的IO对象,而无需写入磁盘,然后处理该伪文件。另外,您的循环会在许多您以后不会open
使用的文件上使用close
,这是非常糟糕的做法,并可能导致问题。
from StringIO import StringIO # import for IO objects
#Merge all pdf files in the parent directory.
merger = PyPDF2.PdfFileMerger()
for file in [a for a in os.listdir() if a.endswith(".pdf")]:
with open(file, 'rb') as f: # this will also close the file once it's done
merger.append(f)
#Merge the files and "output" to an IO object instead of a file
tmp = StringIO()
merger.write(tmp)
#Now extract text from the merged object.
pdf_reader = PyPDF2.PdfFileReader(tmp)
for page in pdf_reader.pages:
#For each page, get corpus of text.
for line in page.extractText().splitlines(): corpus += line