Question

我对Python比较陌生，并试图为我的for循环实现一个多处理模块。

我在img_urls中存储了一系列图片网址，我需要下载并应用Google愿景。

<!doctype html>
<html>
<body>
<div style="height:200px;background-color:red;">
</div>
<h1>Some other content</h1>
<h1>Some other content</h1>
<h1>Some other content</h1>
<div style="height:25px;background-color:blue;color:white">
    footer
</div>

</body>
</html>

这是我的runAll（）方法

if __name__ == '__main__':

    img_urls = [ALL_MY_Image_URLS]
    runAll(img_urls)
    print("--- %s seconds ---" % (time.time() - start_time))

当我运行它并且python崩溃时，我得到这个警告

def runAll(img_urls):
    num_cores = multiprocessing.cpu_count()

    print("Image URLS  {}",len(img_urls))
    if len(img_urls) > 2:
        numberOfImages = 0
    else:
        numberOfImages = 1

    start_timeProcess = time.time()

    pool = multiprocessing.Pool()
    pool.map(annotate,img_urls)
    end_timeProcess = time.time()
    print('\n Time to complete ', end_timeProcess-start_timeProcess)

    print(full_matching_pages)


def annotate(img_path):
    file =  requests.get(img_path).content
    print("file is",file)
    """Returns web annotations given the path to an image."""
    print('Process Working under ',os.getpid())
    image = types.Image(content=file)
    web_detection = vision_client.web_detection(image=image).web_detection
    report(web_detection)

Answer 1

由于增加了安全性以限制Mac OS High Sierra中的多线程而发生此错误。我知道这个答案有点晚了，但是我使用以下方法解决了这个问题：

设置环境变量.bash_profile以允许新的Mac OS High Sierra安全规则下的多线程应用程序或脚本。

打开终端：

$ nano .bash_profile

在文件末尾添加以下行：

export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES

保存，退出，关闭终端，然后重新打开终端。检查是否已设置环境变量：

$ env

您将看到类似于以下内容的输出

TERM_PROGRAM=Apple_Terminal
SHELL=/bin/bash
TERM=xterm-256color
TMPDIR=/var/folders/pn/vasdlj3ojO#OOas4dasdffJq/T/
Apple_PubSub_Socket_Render=/private/tmp/com.apple.launchd.E7qLFJDSo/Render
TERM_PROGRAM_VERSION=404
TERM_SESSION_ID=NONE
OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES

您现在应该可以在多线程中运行python脚本了。

Answer 2

在环境中没有OBJC_DISABLE_INITIALIZE_FORK_SAFETY标志的情况下适用于我的解决方案包括在multiprocessing.Pool程序启动后立即初始化main()类。

这很可能不是最快的解决方案，并且我不确定它是否在所有情况下都有效，但是，在程序启动之前足够早地预热工作进程不会导致任何... may have been in progress in another thread when fork() was called错误，并且与非并行代码相比，我确实获得了显着的性能提升。

我创建了一个便利类Parallelizer，该类很早就开始使用，然后在程序的整个生命周期中使用。

# entry point to my program
def main():
    parallelizer = Parallelizer()
    ...

然后，每当要进行并行化时：

# this function is parallelized. it is run by each child process.
def processing_function(input):
    ...
    return output

...
inputs = [...]
results = parallelizer.map(
    inputs,
    processing_function
)

和并行器类：

class Parallelizer:
    def __init__(self):
        self.input_queue = multiprocessing.Queue()
        self.output_queue = multiprocessing.Queue()
        self.pool = multiprocessing.Pool(multiprocessing.cpu_count(),
                                         Parallelizer._run,
                                         (self.input_queue, self.output_queue,))

    def map(self, contents, processing_func):
        size = 0
        for content in contents:
            self.input_queue.put((content, processing_func))
            size += 1
        results = []
        while size > 0:
            result = self.output_queue.get(block=True)
            results.append(result)
            size -= 1
        return results

    @staticmethod
    def _run(input_queue, output_queue):
        while True:
            content, processing_func = input_queue.get(block=True)
            result = processing_func(content)
            output_queue.put(result)

一个警告：并行化的代码可能难以调试，因此我还准备了我的类的非并行化版本，当子进程出现问题时可以启用该版本：

class NullParallelizer:
    @staticmethod
    def map(contents, processing_func):
        results = []
        for content in contents:
            results.append(processing_func(content))
        return results

多处理会导致Python崩溃并在调用fork（）时在另一个线程中发生错误

2 个答案: