规范化范围内的列表数据(0.0,1.0)

时间:2013-05-23 07:48:27

标签: python algorithm optimization

我创建一个带有列表值的dict,其中包含一些参数。这些参数主要是float或int,有时是布尔值(在我的例子中存储为0或1)。现在我想选择dict的最佳条目(=具有最高参数的

因此我需要规范化参数,以便每个参数仅在0 ... 1的范围内。 一种天真的方法是为每个列表“列”创建一个最大值列表,然后将所有值除以这个最大值:

import heapq

a = {1: [1.0, 23.7, 17.5, 0.2],
     2: [0.0, 87.3, 11.2, 0.5],
     3: [1.0, 17.4, 15.2, 0.7]}

ran = len(a.values()[0])

max = [0.0 for i in range(0,ran)]

for vals in a.values():
    max = [max[x] if max[x] > vals[x] else vals[x] for x in range(0,ran)]

a = {k : [v[x]/max[x] for x in range(0,ran)] for k,v in a.items()}

best = heapq.nlargest(1, (v for v in a.values()), key=lambda v: sum(v))

print a
print best

这似乎在这里工作,但是我可以从这里进行任何优化吗?我必须处理的dicts将包含超过1000个条目,参数将在20到50的范围内。 我还需要在大约1000套dicts上做这个,所以快速的方法会有很大的帮助。

编辑:我现在用生成的数据测试了它:

import heapq
import random

def normalise(a):
    ran = len(a.values()[0])

    max = [0.0 for i in range(0,ran)]

    for vals in a.values():
        max = [max[x] if max[x] > vals[x] else vals[x] for x in range(0,ran)]

    a = {k : [v[x]/max[x] for x in range(0,ran)] for k,v in a.items()}

# find best list
    best = heapq.nlargest(1, (v for v in a.values()), key=lambda v: sum(v))


# test this 1000 times 
for _ in xrange(1000):
    a = { k: [1000.0*random.random() for i in xrange(50)] for k in xrange(1000)} 
    normalise(a)

并得到以下结果:

25,84s user 0,02s system 49% cpu 52,189 total, running python normalise.py

2 个答案:

答案 0 :(得分:4)

你想直接在dict上循环并直接处理每个列表:

from operator import itemgetter

best = (0, [])
maxes = [max(c) for c in zip(*a.values())]
for k, v in a.iteritems():
    v = a[k] = [c/m for c, m in zip(v, maxes)]
    best = max([best, (sum(v), v)], key=itemgetter(0))

这使用zip(*iterable)循环遍历a。然后,我们将每行标准化为每列的最大值,并同时选出最佳行。

请注意heapq.nlargest(1, ...)只使用max代替,因为这是更有效的方法。

使用timeit module衡量的时间与原始样本的对比:

>>> from timeit import timeit
>>> from operator import itemgetter
>>> import heapq
>>> def original(a):
...     ran = len(a.values()[0])
...     max = [0.0 for i in range(0,ran)]
...     for vals in a.values():
...         max = [max[x] if max[x] > vals[x] else vals[x] for x in range(0,ran)]
...     a = {k : [v[x]/max[x] for x in range(0,ran)] for k,v in a.items()}
...     best = heapq.nlargest(1, (v for v in a.values()), key=lambda v: sum(v))
... 
>>> def zip_and_max(a):
...     best = (0, [])
...     maxes = [max(c) for c in zip(*a.values())]
...     for k, v in a.iteritems():
...         v = a[k] = [c/m for c, m in zip(v, maxes)]
...         best = max([best, (sum(v), v)], key=itemgetter(0))
... 
>>> timeit('f(a.copy())', 'from __main__ import a, original as f', number=100000)
2.6306018829345703
>>> timeit('f(a.copy())', 'from __main__ import a, zip_and_max as f', number=100000)
1.6974060535430908

并使用一个随机集:

>>> import random
>>> random_a = { k: [1000.0*random.random() for i in xrange(50)] for k in xrange(1000)}
>>> timeit('f(a.copy())', 'from __main__ import a, original as f', number=100000)
2.7121059894561768
>>> timeit('f(a.copy())', 'from __main__ import a, zip_and_max as f', number=100000)
1.745398998260498

每次都有一个随机组(注意,重复次数要低得多):

>>> timeit('f(r())', 'from __main__ import random_dict as r, original as f', number=100)
4.437845945358276
>>> timeit('f(r())', 'from __main__ import random_dict as r, zip_and_max as f', number=100)
3.2406938076019287

但听起来你在这里处理矩阵。您需要查看numpy以获取 far 更高效的库来处理这些矩阵。

答案 1 :(得分:1)

这就是全部:

key, best = max(a.iteritems(), key = lambda t: sum(t[1])/max(t[1]))