我有两个等长的1D numpy数组,id
和data
,其中id
是一系列重复的有序整数,用于定义data
上的子窗口。例如,
id data
1 2
1 7
1 3
2 8
2 9
2 10
3 1
3 -10
我想通过对data
进行分组并采用最大值或最小值来汇总id
。在SQL中,这将是典型的聚合查询,如SELECT MAX(data) FROM tablename GROUP BY id ORDER BY id
。有没有办法可以避免Python循环并以矢量化方式执行此操作,还是必须下拉到C?
答案 0 :(得分:9)
过去几天我一直在看堆栈溢出问题。下面的代码非常类似于numpy.unique的实现,因为它利用了底层的numpy机制,它很可能比你在python循环中所做的任何事情都要快。
import numpy as np
def group_min(groups, data):
# sort with major key groups, minor key data
order = np.lexsort((data, groups))
groups = groups[order] # this is only needed if groups is unsorted
data = data[order]
# construct an index which marks borders between groups
index = np.empty(len(groups), 'bool')
index[0] = True
index[1:] = groups[1:] != groups[:-1]
return data[index]
#max is very similar
def group_max(groups, data):
order = np.lexsort((data, groups))
groups = groups[order] #this is only needed if groups is unsorted
data = data[order]
index = np.empty(len(groups), 'bool')
index[-1] = True
index[:-1] = groups[1:] != groups[:-1]
return data[index]
答案 1 :(得分:6)
纯Python:
from itertools import groupby, imap, izip
from operator import itemgetter as ig
print [max(imap(ig(1), g)) for k, g in groupby(izip(id, data), key=ig(0))]
# -> [7, 10, 1]
变体:
print [data[id==i].max() for i, _ in groupby(id)]
# -> [7, 10, 1]
import numpy as np
# sort by `id` then by `data`
ndx = np.lexsort(keys=(data, id))
id, data = id[ndx], data[ndx]
# get max()
print data[np.r_[np.diff(id), True].astype(np.bool)]
# -> [ 7 10 1]
如果安装了pandas
:
from pandas import DataFrame
df = DataFrame(dict(id=id, data=data))
print df.groupby('id')['data'].max()
# id
# 1 7
# 2 10
# 3 1
答案 2 :(得分:3)
我是Python和Numpy的新手,但似乎你可以使用.at
的{{1}}方法而不是ufunc
:
reduceat
例如:
import numpy as np
data_id = np.array([0,0,0,1,1,1,1,2,2,2,3,3,3,4,5,5,5])
data_val = np.random.rand(len(data_id))
ans = np.empty(data_id[-1]+1) # might want to use max(data_id) and zeros instead
np.maximum.at(ans,data_id,data_val)
当然,只有当你的data_val = array([ 0.65753453, 0.84279716, 0.88189818, 0.18987882, 0.49800668,
0.29656994, 0.39542769, 0.43155428, 0.77982853, 0.44955868,
0.22080219, 0.4807312 , 0.9288989 , 0.10956681, 0.73215416,
0.33184318, 0.10936647])
ans = array([ 0.98969952, 0.84044947, 0.63460516, 0.92042078, 0.75738113,
0.37976055])
值适合用作索引时才有意义(即非负整数而不是巨大...大概如果它们很大/稀疏你可以初始化data_id
使用ans
或其他东西)。
我应该指出np.unique(data_id)
实际上并不需要排序。
答案 3 :(得分:1)
我已在numpy_indexed包中打包了我之前答案的一个版本;很高兴将这一切都包裹起来并在一个简洁的界面中进行测试;此外,它还有更多功能:
import numpy_indexed as npi
group_id, group_max_data = group_by(id).max(data)
等等
答案 4 :(得分:0)
我认为这可以实现您的目标:
[max([val for idx,val in enumerate(data) if id[idx] == k]) for k in sorted(set(id))]
对于外部列表理解,从右到左,set(id)
分组id
,sorted()
对它们进行排序,for k ...
迭代它们,max
在这种情况下,取得另一个列表理解的最大值。所以转向内部列表理解:enumerate(data)
返回data
的索引和值,if id[val] == k
选出与data
{{1}对应的id
成员}}
这会迭代每个k
的完整data
列表。通过对子列表进行一些预处理,可能会加快速度,但它不会是单行的。
答案 5 :(得分:0)
以下解决方案仅需要对数据(不是lexsort)进行排序,并且不需要在组之间查找边界。它依赖于以下事实:如果o
是r
的索引数组,那么r[o] = x
会为r
填充x
每个值的最新值o
{1}},r[[0, 0]] = [1, 2]
将返回r[0] = 2
。它要求您的组是从0到组的整数 - 1,对于numpy.bincount
,并且每个组都有一个值:
def group_min(groups, data):
n_groups = np.max(groups) + 1
result = np.empty(n_groups)
order = np.argsort(data)[::-1]
result[groups.take(order)] = data.take(order)
return result
def group_max(groups, data):
n_groups = np.max(groups) + 1
result = np.empty(n_groups)
order = np.argsort(data)
result[groups.take(order)] = data.take(order)
return result
答案 6 :(得分:0)
比已经接受的答案稍微快一点的答案;就像joeln的回答一样,它避免了更昂贵的lexsort,它适用于任意ufuncs。而且,它只要求密钥是可排序的,而不是在特定范围内进行整理。考虑到未明确计算最大值/分钟,接受的答案可能仍然更快。忽略已接受解决方案的nans的能力是整洁的;但也可以简单地为nan值指定一个虚拟键。
import numpy as np
def group(key, value, operator=np.add):
"""
group the values by key
any ufunc operator can be supplied to perform the reduction (np.maximum, np.minimum, np.substract, and so on)
returns the unique keys, their corresponding per-key reduction over the operator, and the keycounts
"""
#upcast to numpy arrays
key = np.asarray(key)
value = np.asarray(value)
#first, sort by key
I = np.argsort(key)
key = key[I]
value = value[I]
#the slicing points of the bins to sum over
slices = np.concatenate(([0], np.where(key[:-1]!=key[1:])[0]+1))
#first entry of each bin is a unique key
unique_keys = key[slices]
#reduce over the slices specified by index
per_key_sum = operator.reduceat(value, slices)
#number of counts per key is the difference of our slice points. cap off with number of keys for last bin
key_count = np.diff(np.append(slices, len(key)))
return unique_keys, per_key_sum, key_count
names = ["a", "b", "b", "c", "d", "e", "e"]
values = [1.2, 4.5, 4.3, 2.0, 5.67, 8.08, 9.01]
unique_keys, reduced_values, key_count = group(names, values)
print 'per group mean'
print reduced_values / key_count
unique_keys, reduced_values, key_count = group(names, values, np.minimum)
print 'per group min'
print reduced_values
unique_keys, reduced_values, key_count = group(names, values, np.maximum)
print 'per group max'
print reduced_values
答案 7 :(得分:0)
只有numpy且没有循环:
id = np.asarray([1,1,1,2,2,2,3,3])
data = np.asarray([2,7,3,8,9,10,1,-10])
# max
_ndx = np.argsort(id)
_id, _pos = np.unique(id[_ndx], return_index=True)
g_max = np.maximum.reduceat(data[_ndx], _pos)
# min
_ndx = np.argsort(id)
_id, _pos = np.unique(id[_ndx], return_index=True)
g_min = np.minimum.reduceat(data[_ndx], _pos)
# compare results with pandas groupby
np_group = pd.DataFrame({'min':g_min, 'max':g_max}, index=_id)
pd_group = pd.DataFrame({'id':id, 'data':data}).groupby('id').agg(['min','max'])
(pd_group.values == np_group.values).all() # TRUE