Question

我有一个名为“a”的100000000x2数组，第一列中有索引，第二列中有相关值。我需要为每个索引获取第二列中数字的中值。这就是我用for语句做的事情：

import numpy as np
b = np.zeros(1000000)
a = np.array([[1, 2],
              [1, 3],
              [2, 3],
              [2, 4],
              [2, 6],
              [1, 4],
              ...
              ...
              [1000000,6]])
for i in xrange(1000000):
    b[i]=np.median(a[np.where(a[:,0]==i),1])

显然，对于迭代来说，它太慢了：任何建议？感谢

Answer 1

这称为“分组依据”操作。熊猫（http://pandas.pydata.org/）是一个很好的工具：

import numpy as np
import pandas as pd

a = np.array([[1.0, 2.0],
              [1.0, 3.0],
              [2.0, 5.0],
              [2.0, 6.0],
              [2.0, 8.0],
              [1.0, 4.0],
              [1.0, 1.0],
              [1.0, 3.5],
              [5.0, 8.0],
              [2.0, 1.0],
              [5.0, 9.0]])

# Create the pandas DataFrame.
df = pd.DataFrame(a, columns=['index', 'value'])

# Form the groups.
grouped = df.groupby('index')

# `result` is the DataFrame containing the aggregated results.
result = grouped.aggregate(np.median)
print result

输出：

       value
index       
1        3.0
2        5.5
5        8.5

有多种方法可以直接创建包含原始数据的DataFrame，因此您不必首先创建numpy数组a。

有关Pandas中groupby操作的更多信息：http://pandas.pydata.org/pandas-docs/dev/groupby.html

Answer 2

这有点烦人，但至少你可以轻松地删除那个烦人的==，使用排序（这可能是你的速度杀手）。尝试更多可能不是很有用，但如果你自己排序等可能是可能的：

# First sor the whole thing (probably other ways):
sorter = np.argsort(a[:,0]) # sort by class.
a = a[sorter] # sorted version of a

# Now we need to find where there are changes in the class:
w = np.where(a[:-1,0] != a[1:,0])[0] + 1 # Where the class changes.
# for simplicity, append [0] and [len(a)] to have full slices...
w = np.concatenate([0], w, [len(a)])
result = np.zeros(len(w)-1, dtype=a.dtype)
for i in xrange(0, len(w)-1):
    result[0] = np.median(a[w[i]:w[i+1]])

# If the classes are not exactly 1, 2, ..., N we could add class information:
classes = a[w[:-1],0]

如果所有类的大小相同，那么就有1个，2个等等。但是有更好的方法。

编辑：检查Bitwises版本的解决方案以避免最后一个for循环（他还将一些代码隐藏到np.unique，你可能会优先考虑，因为速度无关紧要无论如何）。

Answer 3

这是我的版本，没有，也没有其他模块。我们的想法是对数组进行一次排序，然后只需计算a的第一列中的索引就可以轻松获得中位数的索引：

# sort by first column and then by second
b=a[np.lexsort((a[:,1],a[:,0]))]

# find central value for each index
c=np.unique(b[:,0],return_index=True)[1]
d=np.r_[c,len(a)]
inds=(d[1:]+d[:-1]-1)/2.0
# final result (as suggested by seberg)
medians=np.mean(np.c_[b[np.floor(inds).astype(int),1],
                      b[np.ceil(inds).astype(int),1]],1)

# inds is the index of the median value for each key

如果您愿意，可以缩短代码。

Answer 4

如果您发现自己想要做很多事情，我建议您查看pandas库，这样就可以轻松实现这一点：

>>> df = pandas.DataFrame([["A", 1], ["B", 2], ["A", 3], ["A", 4], ["B", 5]], columns=["One", "Two"])
>>> print df
  One  Two
0   A    1
1   B    2
2   A    3
3   A    4
4   B    5
>>> df.groupby('One').median()
      Two
One     
A    3.0
B    3.5

Answer 5

快速的一线方法：

result = [np.median(a[a[:,0]==ii,1]) for ii in np.unique(a[:,0])]

我不相信你可以做很多事情，以便在不牺牲准确性的情况下加快速度。但是这是另一次尝试，如果你可以跳过排序步骤可能会更快：

num_in_ind = np.bincount(a[:,0])
results = [np.sort(a[a[:,0]==ii,1])[num_in_ind[ii]/2] for ii in np.unique(a[:,0])]

对于小型阵列，后者的速度要快一些。不确定它是否足够快。

处理数组：如何避免使用“for”语句

5 个答案: