过去我面对自己dealing with averaging two paired lists并且我已成功使用了那里提供的答案。
然而,对于大(超过20,000)项目,程序有点慢,我想知道使用NumPy是否会使它更快。
我从两个列表开始,一个浮点数和一个字符串:
names = ["a", "b", "b", "c", "d", "e", "e"]
values = [1.2, 4.5, 4.3, 2.0, 5.67, 8.08, 9.01]
我正在尝试计算相同值的平均值,因此在应用之后,我会得到:
result_names = ["a", "b", "c", "d", "e"]
result_values = [1.2, 4.4, 2.0, 5.67, 8.54]
我将两个列表作为结果示例,但是还有一个(name, value)
元组列表就足够了:
result = [("a", 1.2), ("b", 4.4), ("d", 5.67), ("e", 8.54)]
NumPy最好的方法是什么?
答案 0 :(得分:4)
使用numpy你可以自己编写一些东西,或者你可以使用groupby功能(来自matplotlib.mlab的rec_groupby函数,但速度要慢得多。对于更强大的groupby功能,可能会看pandas),而我将它与Michael Dunn的答案与词典进行了比较:
import numpy as np
import random
from matplotlib.mlab import rec_groupby
listA = [random.choice("abcdef") for i in range(20000)]
listB = [20 * random.random() for i in range(20000)]
names = np.array(listA)
values = np.array(listB)
def f_dict(listA, listB):
d = {}
for a, b in zip(listA, listB):
d.setdefault(a, []).append(b)
avg = []
for key in d:
avg.append(sum(d[key])/len(d[key]))
return d.keys(), avg
def f_numpy(names, values):
result_names = np.unique(names)
result_values = np.empty(result_names.shape)
for i, name in enumerate(result_names):
result_values[i] = np.mean(values[names == name])
return result_names, result_values
这是三个结果:
In [2]: f_dict(listA, listB)
Out[2]:
(['a', 'c', 'b', 'e', 'd', 'f'],
[9.9003182717213765,
10.077784850173568,
9.8623915728699636,
9.9790599744319319,
9.8811096512807097,
10.118695410115953])
In [3]: f_numpy(names, values)
Out[3]:
(array(['a', 'b', 'c', 'd', 'e', 'f'],
dtype='|S1'),
array([ 9.90031827, 9.86239157, 10.07778485, 9.88110965,
9.97905997, 10.11869541]))
In [7]: rec_groupby(struct_array, ('names',), (('values', np.mean, 'resvalues'),))
Out[7]:
rec.array([('a', 9.900318271721376), ('b', 9.862391572869964),
('c', 10.077784850173568), ('d', 9.88110965128071),
('e', 9.979059974431932), ('f', 10.118695410115953)],
dtype=[('names', '|S1'), ('resvalues', '<f8')])
看起来numpy对于这个测试来说要快一点(并且预定义的groupby函数要慢得多):
In [32]: %timeit f_dict(listA, listB)
10 loops, best of 3: 23 ms per loop
In [33]: %timeit f_numpy(names, values)
100 loops, best of 3: 9.78 ms per loop
In [8]: %timeit rec_groupby(struct_array, ('names',), (('values', np.mean, 'values'),))
1 loops, best of 3: 203 ms per loop
答案 1 :(得分:3)
也许一个笨拙的解决方案比你需要的更精细。在没有做任何花哨的事情的情况下,我发现以下内容是“快速闪存”(例如,没有明显的等待,列表中有20000个项目):
import random
listA = [random.choice("abcdef") for i in range(20000)]
listB = [20 * random.random() for i in range(20000)]
d = {}
for a, b in zip(listA, listB):
d.setdefault(a, []).append(b)
for key in d:
print key, sum(d[key])/len(d[key])
您的milage可能会有所不同,具体取决于20000是否是列表的典型长度,以及您是否只在脚本中执行此操作几次,或者您是否执行了数百次/数千次。
答案 2 :(得分:0)
派对有些晚了,但看起来像numpy似乎仍然缺乏这个功能,这是我最好的尝试一个纯粹的numpy解决方案来实现按键分组。对于具有可观大小的问题集,它应该比其他提出的解决方案快得多。这里的关键是极好的reduceat功能。
import numpy as np
def group(key, value):
"""
group the values by key
returns the unique keys, their corresponding per-key sum, and the keycounts
"""
#upcast to numpy arrays
key = np.asarray(key)
value = np.asarray(value)
#first, sort by key
I = np.argsort(key)
key = key[I]
value = value[I]
#the slicing points of the bins to sum over
slices = np.concatenate(([0], np.where(key[:-1]!=key[1:])[0]+1))
#first entry of each bin is a unique key
unique_keys = key[slices]
#sum over the slices specified by index
per_key_sum = np.add.reduceat(value, slices)
#number of counts per key is the difference of our slice points. cap off with number of keys for last bin
key_count = np.diff(np.append(slices, len(key)))
return unique_keys, per_key_sum, key_count
names = ["a", "b", "b", "c", "d", "e", "e"]
values = [1.2, 4.5, 4.3, 2.0, 5.67, 8.08, 9.01]
unique_keys, per_key_sum, key_count = group(names, values)
print per_key_sum / key_count
答案 3 :(得分:0)
一个简单的解决方案,通过numpy,假设vA0和vB0为numpy.arrays,由vA0索引。
import numpy as np
def avg_group(vA0, vB0):
vA, ind, counts = np.unique(vA0, return_index=True, return_counts=True) # get unique values in vA0
vB = vB0[ind]
for dup in vB[counts>1]: # store the average (one may change as wished) of original elements in vA0 reference by the unique elements in vB
vB[np.where(vA==dup)] = np.average(vB0[np.where(vA0==dup)])
return vA, vB