Question

问题：如何使用Python包“slate”在同一路径中读取许多PDF？

我有一个包含600多个PDF的文件夹。

我知道如何使用slate包将单个PDF转换为文本，使用以下代码：

migFiles = [filename for filename in os.listdir(path)
if re.search(r'(.*\.pdf$)', filename) != None]
with open(migFiles[0]) as f:
     doc = slate.PDF(f)

 len(doc)

但是，这会将您一次限制为一个PDF，由“migFiles [0]”指定 - 0是我的路径文件中的第一个PDF。

如何将多个PDF文件同时读入文本，将它们保留为单独的字符串或txt文件？我应该使用另一个套餐吗？如何创建一个“for循环”来读取路径中的所有PDF？

Answer 1

你可以做的是使用一个简单的循环：

docs = []
for filename in migFiles:
   with open(filename) as f:
     docs.append(slate.pdf(f)) 
     # or instead of saving file to memory, just process it now

然后，docs [i]将保存第（i + 1）个pdf文件的文本，您可以随时使用该文件执行任何操作。或者，您可以在for循环中处理文件。

如果您想转换为文字，可以执行以下操作：

docs = []
separator = ' ' # The character you want to use to separate contents of
#  consecutive pages; if you want the contents of each pages to be separated 
# by a newline, use separator = '\n'
for filename in migFiles:
   with open(filename) as f:
     docs.append(separator.join(slate.pdf(f))) # turn the pages into plain-text

或

separator = ' ' 
for filename in migFiles:
   with open(filename) as f:
     txtfile = open(filename[:-4]+".txt",'w')
     # if filename="abc.pdf", filename[:-4]="abc"
     txtfile.write(separator.join(slate.pdf(f)))
     txtfile.close()

Answer 2

试试这个版本：

import glob
import os

import slate

for pdf_file in glob.glob("{}/{}".format(path,"*.pdf")):
   with open(pdf_file) as pdf:
        txt_file = "{}.txt".format(os.path.splitext(pdf_file)[0])
        with open(txt_file,'w') as txt:
             txt.write(slate.pdf(pdf))

这将创建一个文本文件，其名称与pdf文件中的pdf同名，并与转换后的内容的pdf文件相同。

或者，如果你想保存内容 - 试试这个版本;但请记住，如果翻译的内容很大，您可能会耗尽可用的内存：

import glob
import os

import slate

pdf_as_text = {}

for pdf_file in glob.glob("{}/{}".format(path,"*.pdf")):
   with open(pdf_file) as pdf:
        file_without_extension = os.path.splitext(pdf_file)[0]
        pdf_as_text[file_without_extension] = slate.pdf(pdf)

现在您可以使用pdf_as_text['somefile']来获取文本内容。

Python - 如何将许多单独的PDF转换为文本？

2 个答案: