Question

我有一个对象列表需要有效地解压缩到字典中。列表中有超过2,000,000个对象。该操作需要超过1.5小时才能完成。我想知道这是否可以更有效地完成。列表中的对象基于此类。

class ResObj:
def __init__(self, index, result):
    self.loc = index ### This is the location, where the values should go in the final result dictionary
    self.res = result ### This is a dictionary that has values for this location.

    self.loc = 2
    self.res = {'value1':5.4, 'value2':2.3, 'valuen':{'sub_value1':4.5, 'sub_value2':3.4, 'sub_value3':7.6}}

目前我使用此方法执行此操作。

def make_final_result(list_of_results):
    no_sub_result_variables = ['value1', 'value2']
    sub_result_variables = ['valuen']
    sub_value_variables = ['sub_value1', 'sub_value3', 'sub_value3']

    final_result = {}
    num_of_results = len(list_of_results)
    for var in no_sub_result_variables:
        final_result[var] = numpy.zeros(num_of_results)
    for var in sub_result_variables:
        final_result[var] = {sub_var:numpy.zeros(num_of_results) for sub_var in sub_value_variables}

    for obj in list_of_results:
        i = obj.loc
        result = obj.res
        for var in no_sub_result_variables:
            final_result[var][i] = result[var]
        for var in sub_result_variables:
            for name in sub_value_variables:
                try:
                    final_result[var][name][i] = result[var][name]
                except KeyError as e:
                    ##TODO Add a debug check
                    pass

我尝试使用multiprocessing.Manager（）。dict和Manager（）。Array（）为此使用并行，但是，我只能使用2个进程（尽管我手动将进程设置为#of CPU = 24）。你能帮我用更快的方法来提高性能吗？谢谢。

Answer 1

嵌套numpy数组似乎不是构建数据的最佳方法。您可以使用numpy的structured arrays来创建更直观的数据结构。

 # In my virtualenv
 pip uninstall psycopg2
 pip install psycopg2

使用这种生成数据的方式在我的机器上在2秒内创建了2,000,000个长阵列。

要使其适用于import numpy as np # example values values = [ { "v1": 0, "v2": 1, "vs": { "x": 2, "y": 3, "z": 4, } }, { "v1": 5, "v2": 6, "vs": { "x": 7, "y": 8, "z": 9, } } ] def value_to_record(value): """Take a dictionary and convert it to an array-like format""" return ( value["v1"], value["v2"], ( value["vs"]["x"], value["vs"]["y"], value["vs"]["z"] ) ) # define what a record looks like -- f8 is an 8-byte float dtype = [ ("v1", "f8"), ("v2", "f8"), ("vs", [ ("x", "f8"), ("y", "f8"), ("z", "f8") ]) ] # create actual array arr = np.fromiter(map(value_to_record, values), dtype=dtype, count=len(values)) # access individual record print(arr[0]) # prints (0.0, 1.0, (2.0, 3.0, 4.0)) # access specific value assert arr[0]['vs']['x'] == 2 # access all values of a specific field print(arr['v2']) # prints [ 1. 6.] assert arr['v2'].sum() == 7个对象，请按ResObj属性对其进行排序，然后将loc属性传递给res函数。

Answer 2

您可以按密钥名称在流程之间分配工作在这里，我创建了一个工作池，并将var和可选的子变量名称传递给它们使用便宜的fork与工作人员共享庞大的数据集 Unpacker.unpack从ResObj中选择指定的变量并将其作为np.array返回 make_final_result中的主循环组合了final_result中的数组的的Py2 ：

from collections import defaultdict from multiprocessing import Process, Pool import numpy as np class ResObj(object): def __init__(self, index=None, result=None): self.loc = index ### This is the location, where the values should go in the final result dictionary self.res = result ### This is a dictionary that has values for this location. self.loc = 2 self.res = {'value1':5.4, 'value2':2.3, 'valuen':{'sub_value1':4.5, 'sub_value2':3.4, 'sub_value3':7.6}} class Unpacker(object): @classmethod def cls_init(cls, list_of_results): cls.list_of_results = list_of_results @classmethod def unpack(cls, var, name): list_of_results = cls.list_of_results result = np.zeros(len(list_of_results)) if name is None: for i, it in enumerate(list_of_results): result[i] = it.res[var] else: for i, it in enumerate(list_of_results): result[i] = it.res[var][name] return var, name, result #Pool.map doesn't accept instancemethods so the use of a wrapper def Unpacker_unpack((var, name),): return Unpacker.unpack(var, name) def make_final_result(list_of_results): no_sub_result_variables = ['value1', 'value2'] sub_result_variables = ['valuen'] sub_value_variables = ['sub_value1', 'sub_value3', 'sub_value3'] pool = Pool(initializer=Unpacker.cls_init, initargs=(list_of_results, )) final_result = defaultdict(dict) def key_generator(): for var in no_sub_result_variables: yield var, None for var in sub_result_variables: for name in sub_value_variables: yield var, name for var, name, result in pool.imap(Unpacker_unpack, key_generator()): if name is None: final_result[var] = result else: final_result[var][name] = result return final_result if __name__ == '__main__': print make_final_result([ResObj() for x in xrange(10)])

确保您不在Windows上。它缺少fork，并且多处理必须将整个数据集传输到24个工作进程中的每一个希望这会有所帮助。

Answer 3

删除一些缩进以使循环非嵌套：

for obj in list_of_results:
    i = obj.loc
    result = obj.res
    for var in no_sub_result_variables:
        final_result[var][i] = result[var]
    for var in sub_result_variables:
        for name in sub_value_variables:
            try:
                final_result[var][name][i] = result[var][name]
            except KeyError as e:
                ##TODO Add a debug check
                pass

Python：将对象列表解压缩到Dictionary

3 个答案: