Question

我是一个绝对的初学者。我通过类比示例来摸索代码，因此对任何滥用术语表示道歉。

我在python 3中编写了一小段代码：

接受用户输入（计算机上的文件夹）
在文件夹中搜索pdf文件
将PDF的每个页面转换为带有顺序编号的图像。按照编号的顺序迭代jpgs，将它们变成黑色和白色。 OCR扫描文件并将文本输出到对象中，将文本内容保存到.txt文件（通过pytesseract）。删除jpgs，留下.txt文件。大部分时间用于转换为jpgs并可能使它们变成黑白色。

代码有效，但我相信它可以改进。这需要一段时间，所以我想我会尝试使用Pools进行多处理。

我的代码似乎可以创建池。我还可以获得打印文件夹中文件列表的功能，因此它似乎以一种或另一种形式传递给它。

我无法让它工作，现在已经用各种错误反复破解了代码。我认为主要的问题是，我无能为力。

我的代码开始了：

用户输入块（询问用户目录中的文件夹，检查它是否是有效文件夹等）。

OCR块作为一个函数（解析PDF然后将内容输出到单个.txt文件中）

对于循环块作为一个函数（应该遍历文件夹中的每个PDF并在其上执行OCR块。

多处理块（应该将目录中的文件列表提供给循环块。

为了避免编写战争与和平，我在下面列出了循环块和多处理块的最新版本：

    #import necessary modules


home_path = os.path.expanduser('~')

#ask for input with various checking mechanisms to make sure a useful pdfDir is obtained
    pdfDir = home_path + '/Documents/' + input('Please input the folder name where the PDFs are stored. The folder must be directly under the Documents folder. It cannot have a space in it. \n \n Name of folder:') 




def textExtractor():
    #convert pdf to jpeg with a tesseract friendly resolution

    with Img(filename=pdf_filename, resolution=300) as img: #some can be encrypted so use OCR instead of other libraries

    #various lines of code here

    compilation_temp.close()

def per_file_process (subject_files):

        for pdf in subject_files:

            #decode the whole file name as a string 
            pdf_filename = os.fsdecode(pdf)

            #check whether the string ends in .pdf

        if pdf_filename.endswith(".pdf"):

            #call the OCR function on it
            textExtractor()


        else:
            print ('nonsense')


if __name__ == '__main__':

    pool = Pool(2)

    pool.map(per_file_process, os.listdir(pdfDir))

有人愿意/能够指出我的错误吗？

工作时代码的相关位：

#import necessary

home_path = os.path.expanduser('~')

#block accepting input

    pdfDir = home_path + '/Documents/' + input('Please input the folder name where the PDFs are stored. The folder must be directly under the Documents folder. It cannot have a space in it. \n \n Name of folder:') 



def textExtractor():
    #convert pdf to jpeg with a tesseract friendly resolution

    with Img(filename=pdf_filename, resolution=300) as img: #need to think about using generic expanduser or other libraries to allow portability
    #various lines of code to OCR and output .txt file
    compilation_temp.close()


subject_files = os.listdir(pdfDir)
for pdf in subject_files:
         #decode the whole file name as a string you can see
         pdf_filename = os.fsdecode(pdf)
        #check whether the string ends in /pdf
        if pdf_filename.endswith(".pdf"):
            textExtractor()

        else:
            #print for debugging

Answer 1

Pool.map使用os.listdir返回的每个名称重复调用worker函数。在per_file_process中，subject_files是单个文件名，for pdf in subject_files:枚举名称中的各个字符。此外，listdir仅显示基本名称，没有子目录，因此您无法在正确的位置查找pdf。您可以使用glob按扩展名名称进行过滤，并返回文件的工作路径。

您的示例令人困惑...... textExtractor()没有参数，那么如何知道它正在处理哪个文件？我正在走出困境，并假设它确实采取了文件处理的路径。如果是这样，只需通过map提供pdf目录就可以轻松实现并行化。假设处理时间因pdf而异，我将chunksize设置为1，这样早期的整理工作者可以抓取额外的文件进行处理。

from glob import glob
import os
from multiprocessing import Pool

def textExtractor(pdf_filename):
    #convert pdf to jpeg with a tesseract friendly resolution
    with Img(filename=pdf_filename, resolution=300) as img: #some can be encrypted so use OCR instead of other libraries

        #...various lines of code here
    compilation_temp.close()

if __name__ == '__main__':
    #pdfDir is the folder inputted by user
    with Pool(2) as pool:
        # assuming call signature: textExtractor(path_to_file)
        pool.map(textExtractor, 
            (filename for filename in glob(os.path.join(pdfDir, '*.pdf'))
            if os.path.isfile(filename))
            chunksize=1)

Python3多处理

1 个答案: