我很难使用Python多处理模块。
简而言之,我有一个字典对象,可以更新,例如,来自许多s3文件的字符串的出现。字典的关键是我需要的事件,每次找到它时增加1。
示例代码:
import boto3
from multiprocessing import Process, Manager
import simplejson
client = boto3.client('s3')
occurences_to_find = ["occ1", "occ2", "occ3"]
list_contents = []
def getS3Key(prefix_name, occurence_dict):
kwargs = {'Bucket': "bucket_name", 'Prefix': "prefix_name"}
while True:
value = client.list_objects_v2(**kwargs)
try:
contents = value['Contents']
for obj in contents:
key=obj['Key']
yield key
try:
kwargs['ContinuationToken'] = value['NextContinuationToken']
except KeyError:
break
except KeyError:
break
def getS3Object(s3_key, occurence_dict):
object = client.get_object(Bucket=bucket_name, Key=s3_key)
objjects = object['Body'].read()
for object in objects:
object_json = simplejson.loads(activity)
msg = activity_json["msg"]
for occrence in occurence_dict:
if occrence in msg:
occurence_dict[str(occrence)] += 1
break
'''each process will hit this function'''
def doWork(prefix_name_list, occurence_dict):
for prefix_name in prefix_name_list:
for s3_key in getS3Key(prefix_name, occurence_dict):
getS3Object(s3_key, occurence_dict)
def main():
manager = Manager()
'''shared dictionary between processes'''
occurence_dict = manager.dict()
procs = []
s3_prefixes = [["prefix1"], ["prefix2"], ["prefix3"], ["prefix4"]]
for occurrence in occurences_to_find:
occurence_dict[occurrence] = 0
for index,prefix_name_list in enumerate(s3_prefixes):
proc = Process(target=doWork, args=(prefix_name_list, occurence_dict))
procs.append(proc)
for proc in procs:
proc.start()
for proc in procs:
proc.join()
print(occurence_dict)
main()
我遇到代码速度问题,因为代码运行时间超过10000 s3前缀和密钥需要几个小时。我认为管理器字典是共享的,并且被每个进程锁定,因此它不会同时更新;相反,一个过程等待它被释放"。
如何平行更新字典?或者,如何为每个流程维护多个dicts,然后将结果合并到一起?