我有一个对象列表需要有效地解压缩到字典中。列表中有超过2,000,000个对象。该操作需要超过1.5小时才能完成。我想知道这是否可以更有效地完成。 列表中的对象基于此类。
class ResObj:
def __init__(self, index, result):
self.loc = index ### This is the location, where the values should go in the final result dictionary
self.res = result ### This is a dictionary that has values for this location.
self.loc = 2
self.res = {'value1':5.4, 'value2':2.3, 'valuen':{'sub_value1':4.5, 'sub_value2':3.4, 'sub_value3':7.6}}
目前我使用此方法执行此操作。
def make_final_result(list_of_results):
no_sub_result_variables = ['value1', 'value2']
sub_result_variables = ['valuen']
sub_value_variables = ['sub_value1', 'sub_value3', 'sub_value3']
final_result = {}
num_of_results = len(list_of_results)
for var in no_sub_result_variables:
final_result[var] = numpy.zeros(num_of_results)
for var in sub_result_variables:
final_result[var] = {sub_var:numpy.zeros(num_of_results) for sub_var in sub_value_variables}
for obj in list_of_results:
i = obj.loc
result = obj.res
for var in no_sub_result_variables:
final_result[var][i] = result[var]
for var in sub_result_variables:
for name in sub_value_variables:
try:
final_result[var][name][i] = result[var][name]
except KeyError as e:
##TODO Add a debug check
pass
我尝试使用multiprocessing.Manager()。dict和Manager()。Array()为此使用并行,但是,我只能使用2个进程(尽管我手动将进程设置为#of CPU = 24)。 你能帮我用更快的方法来提高性能吗? 谢谢。
答案 0 :(得分:2)
嵌套numpy数组似乎不是构建数据的最佳方法。您可以使用numpy的structured arrays来创建更直观的数据结构。
# In my virtualenv
pip uninstall psycopg2
pip install psycopg2
使用这种生成数据的方式在我的机器上在2秒内创建了2,000,000个长阵列。
要使其适用于import numpy as np
# example values
values = [
{
"v1": 0,
"v2": 1,
"vs": {
"x": 2,
"y": 3,
"z": 4,
}
},
{
"v1": 5,
"v2": 6,
"vs": {
"x": 7,
"y": 8,
"z": 9,
}
}
]
def value_to_record(value):
"""Take a dictionary and convert it to an array-like format"""
return (
value["v1"],
value["v2"],
(
value["vs"]["x"],
value["vs"]["y"],
value["vs"]["z"]
)
)
# define what a record looks like -- f8 is an 8-byte float
dtype = [
("v1", "f8"),
("v2", "f8"),
("vs", [
("x", "f8"),
("y", "f8"),
("z", "f8")
])
]
# create actual array
arr = np.fromiter(map(value_to_record, values), dtype=dtype, count=len(values))
# access individual record
print(arr[0]) # prints (0.0, 1.0, (2.0, 3.0, 4.0))
# access specific value
assert arr[0]['vs']['x'] == 2
# access all values of a specific field
print(arr['v2']) # prints [ 1. 6.]
assert arr['v2'].sum() == 7
个对象,请按ResObj
属性对其进行排序,然后将loc
属性传递给res
函数。
答案 1 :(得分:1)
您可以按密钥名称在流程之间分配工作
在这里,我创建了一个工作池,并将var和可选的子变量名称传递给它们
使用便宜的fork
与工作人员共享庞大的数据集
Unpacker.unpack
从ResObj中选择指定的变量并将其作为np.array返回
make_final_result中的主循环组合了final_result中的数组
的的Py2 强>:
from collections import defaultdict
from multiprocessing import Process, Pool
import numpy as np
class ResObj(object):
def __init__(self, index=None, result=None):
self.loc = index ### This is the location, where the values should go in the final result dictionary
self.res = result ### This is a dictionary that has values for this location.
self.loc = 2
self.res = {'value1':5.4, 'value2':2.3, 'valuen':{'sub_value1':4.5, 'sub_value2':3.4, 'sub_value3':7.6}}
class Unpacker(object):
@classmethod
def cls_init(cls, list_of_results):
cls.list_of_results = list_of_results
@classmethod
def unpack(cls, var, name):
list_of_results = cls.list_of_results
result = np.zeros(len(list_of_results))
if name is None:
for i, it in enumerate(list_of_results):
result[i] = it.res[var]
else:
for i, it in enumerate(list_of_results):
result[i] = it.res[var][name]
return var, name, result
#Pool.map doesn't accept instancemethods so the use of a wrapper
def Unpacker_unpack((var, name),):
return Unpacker.unpack(var, name)
def make_final_result(list_of_results):
no_sub_result_variables = ['value1', 'value2']
sub_result_variables = ['valuen']
sub_value_variables = ['sub_value1', 'sub_value3', 'sub_value3']
pool = Pool(initializer=Unpacker.cls_init, initargs=(list_of_results, ))
final_result = defaultdict(dict)
def key_generator():
for var in no_sub_result_variables:
yield var, None
for var in sub_result_variables:
for name in sub_value_variables:
yield var, name
for var, name, result in pool.imap(Unpacker_unpack, key_generator()):
if name is None:
final_result[var] = result
else:
final_result[var][name] = result
return final_result
if __name__ == '__main__':
print make_final_result([ResObj() for x in xrange(10)])
确保您不在Windows上。它缺少fork
,并且多处理必须将整个数据集传输到24个工作进程中的每一个
希望这会有所帮助。
答案 2 :(得分:0)
删除一些缩进以使循环非嵌套:
for obj in list_of_results:
i = obj.loc
result = obj.res
for var in no_sub_result_variables:
final_result[var][i] = result[var]
for var in sub_result_variables:
for name in sub_value_variables:
try:
final_result[var][name][i] = result[var][name]
except KeyError as e:
##TODO Add a debug check
pass