管理器字典非常慢,可以更新100多个进程的值

时间:2018-03-02 20:09:30

标签: python amazon-s3 python-multiprocessing

我很难使用Python多处理模块。

简而言之,我有一个字典对象,可以更新,例如,来自许多s3文件的字符串的出现。字典的关键是我需要的事件,每次找到它时增加1。

示例代码:

import boto3
from multiprocessing import Process, Manager
import simplejson

client = boto3.client('s3')
occurences_to_find = ["occ1", "occ2", "occ3"]

list_contents = []


def getS3Key(prefix_name, occurence_dict):
    kwargs = {'Bucket': "bucket_name", 'Prefix': "prefix_name"}
    while True:
        value = client.list_objects_v2(**kwargs)
        try:
            contents = value['Contents']
            for obj in contents:
                key=obj['Key']
                yield key
            try:
                kwargs['ContinuationToken'] = value['NextContinuationToken']
            except KeyError:
                break
        except KeyError:
            break

def getS3Object(s3_key, occurence_dict):
    object = client.get_object(Bucket=bucket_name, Key=s3_key)
    objjects = object['Body'].read()
    for object in objects:
        object_json = simplejson.loads(activity)
        msg = activity_json["msg"]
        for occrence in occurence_dict:
            if occrence in msg:
                occurence_dict[str(occrence)] += 1
                break

'''each process will hit this function'''
def doWork(prefix_name_list, occurence_dict):
    for prefix_name in prefix_name_list:
        for s3_key in getS3Key(prefix_name, occurence_dict):
            getS3Object(s3_key, occurence_dict)


def main():
    manager = Manager()
    '''shared dictionary between processes'''
    occurence_dict = manager.dict()
    procs = []
    s3_prefixes = [["prefix1"], ["prefix2"], ["prefix3"], ["prefix4"]]
    for occurrence in occurences_to_find:
        occurence_dict[occurrence] = 0

    for index,prefix_name_list in enumerate(s3_prefixes):
        proc = Process(target=doWork, args=(prefix_name_list, occurence_dict))
        procs.append(proc)

    for proc in procs:
        proc.start()

    for proc in procs:
        proc.join()
    print(occurence_dict)

main()

我遇到代码速度问题,因为代码运行时间超过10000 s3前缀和密钥需要几个小时。我认为管理器字典是共享的,并且被每个进程锁定,因此它不会同时更新;相反,一个过程等待它被释放"。

如何平行更新字典?或者,如何为每个流程维护多个dicts,然后将结果合并到一起?

0 个答案:

没有答案