Question

我有许多大型（> 35,000,000）整数列表，其中包含重复项。我需要计算列表中每个整数的计数。以下代码有效，但似乎很慢。任何人都可以使用Python更好地使用基准测试，最好是Numpy吗？

def group():
    import numpy as np
    from itertools import groupby
    values = np.array(np.random.randint(0,1<<32,size=35000000),dtype='u4')
    values.sort()
    groups = ((k,len(list(g))) for k,g in groupby(values))
    index = np.fromiter(groups,dtype='u4,u2')

if __name__=='__main__':
    from timeit import Timer
    t = Timer("group()","from __main__ import group")
    print t.timeit(number=1)

返回：

$ python bench.py 
111.377498865

干杯！

根据回复

修改：

def group_original():
    import numpy as np
    from itertools import groupby
    values = np.array(np.random.randint(0,1<<32,size=35000000),dtype='u4')
    values.sort()
    groups = ((k,len(list(g))) for k,g in groupby(values))
    index = np.fromiter(groups,dtype='u4,u2')

def group_gnibbler():
    import numpy as np
    from itertools import groupby
    values = np.array(np.random.randint(0,1<<32,size=35000000),dtype='u4')
    values.sort()
    groups = ((k,sum(1 for i in g)) for k,g in groupby(values))
    index = np.fromiter(groups,dtype='u4,u2')

def group_christophe():
    import numpy as np
    values = np.array(np.random.randint(0,1<<32,size=35000000),dtype='u4')
    values.sort()
    counts=values.searchsorted(values, side='right') - values.searchsorted(values, side='left')
    index = np.zeros(len(values),dtype='u4,u2')
    index['f0']=values
    index['f1']=counts
    #Erroneous result!

def group_paul():
    import numpy as np
    values = np.array(np.random.randint(0,1<<32,size=35000000),dtype='u4')
    values.sort()
    diff = np.concatenate(([1],np.diff(values)))
    idx = np.concatenate((np.where(diff)[0],[len(values)]))
    index = np.empty(len(idx)-1,dtype='u4,u2')
    index['f0']=values[idx[:-1]]
    index['f1']=np.diff(idx)

if __name__=='__main__':
    from timeit import Timer
    timings=[
                ("group_original","Original"),
                ("group_gnibbler","Gnibbler"),
                ("group_christophe","Christophe"),
                ("group_paul","Paul"),
            ]
    for method,title in timings:
        t = Timer("%s()"%method,"from __main__ import %s"%method)
        print "%s: %s secs"%(title,t.timeit(number=1))

返回：

$ python bench.py 
Original: 113.385262966 secs
Gnibbler: 71.7464978695 secs
Christophe: 27.1690568924 secs
Paul: 9.06268405914 secs

尽管Christophe目前提供的结果不正确

Answer 1

我做了3倍的改进：

def group():
    import numpy as np
    values = np.array(np.random.randint(0,3298,size=35000000),dtype='u4')
    values.sort()
    dif = np.ones(values.shape,values.dtype)
    dif[1:] = np.diff(values)
    idx = np.where(dif>0)
    vals = values[idx]
    count = np.diff(idx)

Answer 2

自从保罗的回答被接受以来已经过去了5年多。有趣的是， sort()仍然是公认解决方案的瓶颈。

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     3                                           @profile
     4                                           def group_paul():
     5         1        99040  99040.0      2.4      import numpy as np
     6         1       305651 305651.0      7.4      values = np.array(np.random.randint(0, 2**32,size=35000000),dtype='u4')
     7         1      2928204 2928204.0    71.3      values.sort()
     8         1        78268  78268.0      1.9      diff = np.concatenate(([1],np.diff(values)))
     9         1       215774 215774.0      5.3      idx = np.concatenate((np.where(diff)[0],[len(values)]))
    10         1           95     95.0      0.0      index = np.empty(len(idx)-1,dtype='u4,u2')
    11         1       386673 386673.0      9.4      index['f0'] = values[idx[:-1]]
    12         1        91492  91492.0      2.2      index['f1'] = np.diff(idx)

接受的解决方案在我的机器上运行4.0秒，并对其进行基数排序下降到1.7秒。

只需切换到基数排序，我就可以获得2.35倍的加速。在这种情况下，基数排序比快速排序快4倍。

请参阅由您的问题激发的How to sort an array of integers faster than quicksort?。

Answer 3

根据要求，这是一个Cython版本。我在阵列中做了两次传球。第一个找出有多少独特元素，这样我的数组就可以获得唯一值和适当大小的数量。

import numpy as np
cimport numpy as np
cimport cython

@cython.boundscheck(False)
def dogroup():
    cdef unsigned long tot = 1
    cdef np.ndarray[np.uint32_t, ndim=1] values = np.array(np.random.randint(35000000,size=35000000),dtype=np.uint32)
    cdef unsigned long i, ind, lastval
    values.sort()
    for i in xrange(1,len(values)):
        if values[i] != values[i-1]:
            tot += 1
    cdef np.ndarray[np.uint32_t, ndim=1] vals = np.empty(tot,dtype=np.uint32)
    cdef np.ndarray[np.uint32_t, ndim=1] count = np.empty(tot,dtype=np.uint32)
    vals[0] = values[0]
    ind = 1
    lastval = 0
    for i in xrange(1,len(values)):
        if values[i] != values[i-1]:
            vals[ind] = values[i]
            count[ind-1] = i - lastval
            lastval = i
            ind += 1
    count[ind-1] = len(values) - lastval

到目前为止，排序实际上占用的时间最多。使用我的代码中给出的values数组，排序需要4.75秒，实际查找的唯一值和计数需要0.67秒。使用纯粹的Numpy代码使用Paul的代码（但是使用相同形式的values数组）和我在注释中建议的修复，找到唯一值和计数需要1.9秒（排序仍然需要相同的时间量）。

大部分时间通过排序有意义，因为它是O（N log N）并且计数是O（N）。你可以加快Numpy的排序（如果我没记错的话，使用C的qsort），但你必须真正知道你在做什么，这可能是不值得的。此外，可能有一些方法可以加快我的Cython代码，但它可能不值得。

Answer 4

这是一个愚蠢的解决方案：

def group():
    import numpy as np
    values = np.array(np.random.randint(0,1<<32,size=35000000),dtype='u4')

    # we sort in place
    values.sort()

    # when sorted the number of occurences for a unique element is the index of 
    # the first occurence when searching from the right - the index of the first
    # occurence when searching from the left.
    #
    # np.dstack() is the numpy equivalent to Python's zip()

    l = np.dstack((values, values.searchsorted(values, side='right') - \
                   values.searchsorted(values, side='left')))

    index = np.fromiter(l, dtype='u4,u2')

if __name__=='__main__':
    from timeit import Timer
    t = Timer("group()","from __main__ import group")
    print t.timeit(number=1)

在我的机器上运行大约 25 秒，而初始解决方案大约 96 （这是一个很好的改进）。

可能还有改进的余地，我不经常使用numpy。

修改：在代码中添加了一些注释。

Answer 5

这是一个相当陈旧的主题，但我想我会提到目前已接受的解决方案有一点改进：

def group_by_edge():
    import numpy as np
    values = np.array(np.random.randint(0,1<<32,size=35000000),dtype='u4')
    values.sort()
    edges = (values[1:] != values[:-1]).nonzero()[0] - 1
    idx = np.concatenate(([0], edges, [len(values)]))
    index = np.empty(len(idx) - 1, dtype= 'u4, u2')
    index['f0'] = values[idx[:-1]]
    index['f1'] = np.diff(idx)

这在我的机器上测试速度大约快半秒;不是一个巨大的改进，但值得一些。另外，我认为这里发生的事情更清楚了;乍看之下，两步diff方法有点不透明。

Answer 6

我想最明显但仍未提及的方法是，简单地使用collections.Counter。它不是使用groupby构建大量临时使用的列表，而是仅提升整数。它是一个oneliner和2倍的加速，但仍然比纯粹的numpy解决方案慢。

def group():
    import sys
    import numpy as np
    from collections import Counter
    values = np.array(np.random.randint(0,sys.maxint,size=35000000),dtype='u4')
    c = Counter(values)

if __name__=='__main__':
    from timeit import Timer
    t = Timer("group()","from __main__ import group")
    print t.timeit(number=1)

与初始解决方案相比，我的机器从136秒加速到62秒。

Answer 7

用len(list(g))替换sum(1 for i in g)可获得2倍的加速

Answer 8

在numpy的最新版本中，我们有这个。

import numpy as np
frequency = np.unique(values, return_counts=True)

Answer 9

排序是theta（NlogN），我会选择由Python的哈希表实现提供的摊销O（N）。只需使用defaultdict(int)来保持整数的计数，只需迭代一次数组：

counts = collections.defaultdict(int)
for v in values:
    counts[v] += 1

理论上这更快，遗憾的是我现在无法检查。分配额外的内存可能会使其实际上比你的解决方案更慢，这就是就地。

编辑：如果你需要保存内存，请尝试基数排序，这比整数排序要快得多（我认为这是numpy使用的）。

Answer 10

您可以尝试以下（ab）使用scipy.sparse：

from scipy import sparse
def sparse_bincount(values):
    M = sparse.csr_matrix((np.ones(len(values)), values.astype(int), [0, len(values)]))
    M.sum_duplicates()
    index = np.empty(len(M.indices),dtype='u4,u2')
    index['f0'] = M.indices
    index['f1']= M.data
    return index

这比获胜的答案慢，可能是因为scipy目前不支持unsigned作为索引类型......

使用itertools.groupby性能进行Numpy分组

10 个答案: