Question

我有一个包含PDF文件的大型目录（图像），如何有效地从目录中的所有文件中提取文本？到目前为止，我试图：

import multiprocessing
import textract

def extract_txt(file_path):
    text = textract.process(file_path, method='tesseract')

p = multiprocessing.Pool(2)
file_path = ['/Users/user/Desktop/sample.pdf']
list(p.map(extract_txt, file_path))

然而，它不起作用......需要花费很多时间（我有一些文件有600页）。另外：a）我不知道如何有效地处理目录转换部分。 b）我想添加一个页面分隔符，让我们说：<start/age = 1> ... page content ... <end/page = 1>，但我不知道如何做到这一点。

因此，如何将extract_txt函数应用于以.pdf结尾的目录的所有元素，并以.txt格式返回另一个目录中的相同文件，以及添加带有OCR文本提取的页面分隔符？。

另外，我很高兴使用谷歌文档来完成这项任务，是否有可能以编程方式使用谷歌文档解决上述文本提取问题？。

更新

关于＆＃34;添加页面分隔符＆＃34;问题（<start/age = 1> ... page content ... <end/page = 1>）在阅读了罗兰史密斯的答案后，我试图：

from PyPDF2 import PdfFileWriter, PdfFileReader
import textract


def extract_text(pdf_file):
    inputpdf = PdfFileReader(open(pdf_file, "rb"))
    for i in range(inputpdf.numPages):
        w = PdfFileWriter()
        w.addPage(inputpdf.getPage(i))
        outfname = 'page{:03d}.pdf'.format(i)
        with open(outfname, 'wb') as outfile:  # I presume you need `wb`.
             w.write(outfile)
        print('\n<begin page pos =' , i, '>\n')
        text = textract.process(str(outfname), method='tesseract')
        os.remove(outfname)  # clean up.
        print(str(text, 'utf8'))
        print('\n<end page pos =' , i, '>\n')

extract_text('/Users/user/Downloads/ImageOnly.pdf')

但是，我仍然遇到print()部分的问题，因为不是打印，而是将所有输出保存到文件中会更有用。因此，我尝试将输出重定向到一个文件：

sys.stdout=open("test.txt","w")
print('\n<begin page pos =' , i, '>\n')
sys.stdout.close()
text = textract.process(str(outfname), method='tesseract')
os.remove(outfname)  # clean up.
sys.stdout=open("test.txt","w")
print(str(text, 'utf8'))
sys.stdout.close()
sys.stdout=open("test.txt","w")
print('\n<end page pos =' , i, '>\n')
sys.stdout.close()

知道如何使页面提取/分隔符技巧并将所有内容保存到文件中吗？...

Answer 1

在您的代码中，您正在提取文本，但您不会对其执行任何操作。

尝试这样的事情：

def extract_txt(file_path):
    text = textract.process(file_path, method='tesseract')
    outfn = file_path[:-4] + '.txt'  # assuming filenames end with '.pdf'
    with open(outfn, 'wb') as output_file:
        output_file.write(text)
    return file_path

这会将文本写入具有相同名称但扩展名为.txt的文件。

它还返回原始文件的路径，让父级知道此文件已完成。

所以我会将映射代码更改为：

p = multiprocessing.Pool()
file_path = ['/Users/user/Desktop/sample.pdf']
for fn in p.imap_unordered(extract_txt, file_path):
    print('completed file:', fn)

创建Pool时，您不需要提出论据。默认情况下，它将创建与cpu-cores一样多的工作程序。
使用imap_unordered创建一个迭代器，一旦可用就开始产生值。
因为worker函数返回了文件名，所以您可以打印它以让用户知道该文件已完成。

修改1 ：

另外一个问题是，是否可以标记页面边界。我想是的。

肯定工作的方法是在 OCR之前将PDF文件拆分为页面。你可以用例如来自poppler-utils包的pdfinfo找出文档中的页数。然后你可以使用例如来自同一个poppler-utils包的pdfseparate将N页的一个pdf文件转换为一页的N个pdf文件。然后，您可以单独OCR单页PDF文件。这将分别为每个页面提供文本。

或者，您可以OCR整个文档，然后搜索分页符。如果文档在每个页面上具有常量或可预测的页眉或页脚，则仅起作用。它可能不如上述方法可靠。

编辑2：

如果您需要文件，写文件：

from PyPDF2 import PdfFileWriter, PdfFileReader import textract def extract_text(pdf_file): inputpdf = PdfFileReader(open(pdf_file, "rb")) outfname = pdf_file[:-4] + '.txt' # Assuming PDF file name ends with ".pdf" with open(outfname, 'w') as textfile: for i in range(inputpdf.numPages): w = PdfFileWriter() w.addPage(inputpdf.getPage(i)) outfname = 'page{:03d}.pdf'.format(i) with open(outfname, 'wb') as outfile: # I presume you need `wb`. w.write(outfile) print('page', i) text = textract.process(outfname, method='tesseract') # Add header and footer. text = '\n<begin page pos = {}>\n'.format(i) + text + '\n<end page pos = {}>\n'.format(i) # Write the OCR-ed text to the output file. textfile.write(text) os.remove(outfname) # clean up. print(text)

如何使用OCR有效地从PDF文件目录中提取文本？

1 个答案: