在Python脚本中使用多重处理

时间:2020-07-22 17:29:53

标签: python multiprocessing

我正在尝试按品牌->产品->每个产品图像标记多个图像。由于一次标记每个图像需要花费一些时间,因此我决定使用多处理来加快工作速度。我尝试使用多处理程序,它肯定可以加快图像的标注速度,但是代码无法按我的预期工作。

代码:

def multiprocessing_func(line):
    json_line = json.loads(line)
    product = json_line['groupid']
    active_urls = set(json_line['urls'])

    try:
        active_urls.remove(brand_dic[brand])
    except:
        pass

    if product in saved_product_dict and active_urls == saved_product_dict[product]:
        keep_products.append(product)
        print('True')
    else:
        with open(new_images_filename, 'a') as save_file:
            labels = label_product_images(line)
            save_file.write('{}\n'.format(json.dumps(labels)))
        print('False')


    active_images_filename = 'data/input/image_urls.json'
    new_images_filename = 'data/output/new_labeled_images.json'
    saved_images_filename = 'data/output/saved_labeled_images.json'
    
    brand_dic = {'a': 'https://www.a.com/imgs/ab/images/dp/m.jpg',
                 'b': 'https://www.b.com/imgs/ab/images/wcm/m.jpg',
                 'c': 'https://www.c.com/imgs/ab/images/dp/m.jpg',}
    
    if __name__ == '__main__':
        brands = ['a', 'b', 'c']
        for brand in brands:
            active_images_filename = 'data/input/brands/' + brand + '/image_urls.json'
            new_images_filename = 'data/output/brands/' + brand + '/new_labeled_images.json'
            saved_images_filename = 'data/output/brands/' + brand + '/saved_labeled_images.json'
    
            print(new_images_filename)
            with open(new_images_filename, 'w'): pass
    
    
            saved_product_dict = {}
            with open(saved_images_filename) as in_file:
                for line in in_file:
                    json_line = json.loads(line)
                    saved_urls = [url for urls_list in json_line['urls'] for url in urls_list]
                    saved_product_dict[json_line['groupid']] = set(saved_urls)
    
    
            print(saved_product_dict)
            keep_products = []
            labels_list = []
            with open(active_images_filename, 'r') as in_file:
                processes = []
                for line in in_file:
                    p = multiprocessing.Process(target=multiprocessing_func, args=(line,))
                    processes.append(p)
                    p.start()
    
            print('complete stage 1')
    
        for i in range(0,2):
            print('running stage 2')

输出:

data/output/brands/mg/new_labeled_images.json
{}
complete stage 1
running stage 2
running stage 2
silo : https://www.a.com/mgimgs/rk/images/dp/wcm/202025/0011/terminal-1-soft-sided-carry-on-m.jpg
silo : https://www.a.com/mgimgs/rk/images/dp/wcm/202025/0011/terminal-1-soft-sided-carry-on-m.jpg
silo : https://www.a.com/mgimgs/rk/images/dp/wcm/202010/0027/anchor-hope-and-protect-necklace-m.jpg
silo : https://www.a.com/mgimgs/rk/images/dp/wcm/202007/0003/patterned-folded-notecards-set-of-25-m.jpg
silo : https://www.a.com/mgimgs/rk/images/dp/wcm/202005/0003/patterned-folded-notecards-set-of-25-t.jpg
silo : https://a/mgimgs/rk/images/dp/wcm/202007/0002/patterned-folded-notecards-set-of-25-1-m.jpg
unmatched : https://www.a.com/mgimgs/rk/images/dp/a/202010/0013.jpg
silo : https://www.a.com/mgimgs/rk/images/dp/a/202007/0002.jpg
silo : https://www.a.com/mgimgs/rk/images/dp/a/202007/0003.jpg
False
unmatched : https://www.a.com/mgimgs/rk/images/dp/a/202010/0022.jpg
silo : https://www.a.com/mgimgs/rk/images/dp/wcm/202019/454.jpg
False
lifestyle - Lif1 : https://a.com/mgimgs/rk/images/dp/wcm/202025/0011.jpg
False
False

我注意到多处理步骤最后执行并跳过代码,我不确定为什么要这样做。另外,我不确定为什么它没有运行第一部分,当我尝试打印“ saved_product_dict”时,字典空了。

在多处理步骤之前和之后,我都有在其之前运行的代码。我的问题是如何强制多处理步骤按编写代码的顺序运行。对此的任何解释将不胜感激。我是使用多重处理的新手,但仍在学习其工作原理。

1 个答案:

答案 0 :(得分:1)

这行似乎是错误的。尝试更改

saved_urls = [url for urls_list in json_line['urls'] for url in urls_list]

具有:

saved_urls = [url for urls_list in json_line['urls]]

这可能是问题第一部分的解决方案。

关于多处理部分和程序主线程的打印。在异步环境(此处存在不同的过程)中,打印顺序并不总是正确指示功能/脚本的运行时间。如果要按定义的顺序运行脚本,则需要使用信号量和互斥体来实现同步机制,或者等待所有进程退出后再进入第2阶段,这是我想的主要问题。

相关问题