我正在尝试按品牌->产品->每个产品图像标记多个图像。由于一次标记每个图像需要花费一些时间,因此我决定使用多处理来加快工作速度。我尝试使用多处理程序,它肯定可以加快图像的标注速度,但是代码无法按我的预期工作。
代码:
def multiprocessing_func(line):
json_line = json.loads(line)
product = json_line['groupid']
active_urls = set(json_line['urls'])
try:
active_urls.remove(brand_dic[brand])
except:
pass
if product in saved_product_dict and active_urls == saved_product_dict[product]:
keep_products.append(product)
print('True')
else:
with open(new_images_filename, 'a') as save_file:
labels = label_product_images(line)
save_file.write('{}\n'.format(json.dumps(labels)))
print('False')
active_images_filename = 'data/input/image_urls.json'
new_images_filename = 'data/output/new_labeled_images.json'
saved_images_filename = 'data/output/saved_labeled_images.json'
brand_dic = {'a': 'https://www.a.com/imgs/ab/images/dp/m.jpg',
'b': 'https://www.b.com/imgs/ab/images/wcm/m.jpg',
'c': 'https://www.c.com/imgs/ab/images/dp/m.jpg',}
if __name__ == '__main__':
brands = ['a', 'b', 'c']
for brand in brands:
active_images_filename = 'data/input/brands/' + brand + '/image_urls.json'
new_images_filename = 'data/output/brands/' + brand + '/new_labeled_images.json'
saved_images_filename = 'data/output/brands/' + brand + '/saved_labeled_images.json'
print(new_images_filename)
with open(new_images_filename, 'w'): pass
saved_product_dict = {}
with open(saved_images_filename) as in_file:
for line in in_file:
json_line = json.loads(line)
saved_urls = [url for urls_list in json_line['urls'] for url in urls_list]
saved_product_dict[json_line['groupid']] = set(saved_urls)
print(saved_product_dict)
keep_products = []
labels_list = []
with open(active_images_filename, 'r') as in_file:
processes = []
for line in in_file:
p = multiprocessing.Process(target=multiprocessing_func, args=(line,))
processes.append(p)
p.start()
print('complete stage 1')
for i in range(0,2):
print('running stage 2')
输出:
data/output/brands/mg/new_labeled_images.json
{}
complete stage 1
running stage 2
running stage 2
silo : https://www.a.com/mgimgs/rk/images/dp/wcm/202025/0011/terminal-1-soft-sided-carry-on-m.jpg
silo : https://www.a.com/mgimgs/rk/images/dp/wcm/202025/0011/terminal-1-soft-sided-carry-on-m.jpg
silo : https://www.a.com/mgimgs/rk/images/dp/wcm/202010/0027/anchor-hope-and-protect-necklace-m.jpg
silo : https://www.a.com/mgimgs/rk/images/dp/wcm/202007/0003/patterned-folded-notecards-set-of-25-m.jpg
silo : https://www.a.com/mgimgs/rk/images/dp/wcm/202005/0003/patterned-folded-notecards-set-of-25-t.jpg
silo : https://a/mgimgs/rk/images/dp/wcm/202007/0002/patterned-folded-notecards-set-of-25-1-m.jpg
unmatched : https://www.a.com/mgimgs/rk/images/dp/a/202010/0013.jpg
silo : https://www.a.com/mgimgs/rk/images/dp/a/202007/0002.jpg
silo : https://www.a.com/mgimgs/rk/images/dp/a/202007/0003.jpg
False
unmatched : https://www.a.com/mgimgs/rk/images/dp/a/202010/0022.jpg
silo : https://www.a.com/mgimgs/rk/images/dp/wcm/202019/454.jpg
False
lifestyle - Lif1 : https://a.com/mgimgs/rk/images/dp/wcm/202025/0011.jpg
False
False
我注意到多处理步骤最后执行并跳过代码,我不确定为什么要这样做。另外,我不确定为什么它没有运行第一部分,当我尝试打印“ saved_product_dict”时,字典空了。
在多处理步骤之前和之后,我都有在其之前运行的代码。我的问题是如何强制多处理步骤按编写代码的顺序运行。对此的任何解释将不胜感激。我是使用多重处理的新手,但仍在学习其工作原理。
答案 0 :(得分:1)
这行似乎是错误的。尝试更改
saved_urls = [url for urls_list in json_line['urls'] for url in urls_list]
具有:
saved_urls = [url for urls_list in json_line['urls]]
这可能是问题第一部分的解决方案。
关于多处理部分和程序主线程的打印。在异步环境(此处存在不同的过程)中,打印顺序并不总是正确指示功能/脚本的运行时间。如果要按定义的顺序运行脚本,则需要使用信号量和互斥体来实现同步机制,或者等待所有进程退出后再进入第2阶段,这是我想的主要问题。