Question

我需要计算相对较大数组中不同列的数量。

def nodistinctcols(M):
    setofcols = set()
    for column in M.T:
        setofcols.add(repr(column))
    return len(setofcols)

X = np.array([np.random.randint(2, size = 16) for i in xrange(2**16)])

print "nodistinctcols(X.T)", nodistinctcols(X.T)

最后一行在我的电脑上需要20秒，这似乎过于缓慢。相比之下，X = np.array([np.random.randint(2, size = 16) for i in xrange(2**16)])需要216毫秒。 nodistinctcols可以加速吗？

Answer 1

您可以使用view更改M的dtype，以便将整行（或列）视为字节数组。然后可以应用np.unique来查找唯一值：

import numpy as np

def asvoid(arr):
    """
    View the array as dtype np.void (bytes).

    This views the last axis of ND-arrays as np.void (bytes) so 
    comparisons can be performed on the entire row.
    http://stackoverflow.com/a/16840350/190597 (Jaime, 2013-05)

    Some caveats:
        - `asvoid` will work for integer dtypes, but be careful if using asvoid on float
        dtypes, since float zeros may compare UNEQUALLY:
        >>> asvoid([-0.]) == asvoid([0.])
        array([False], dtype=bool)

        - `asvoid` works best on contiguous arrays. If the input is not contiguous,
        `asvoid` will copy the array to make it contiguous, which will slow down the
        performance.

    """
    arr = np.ascontiguousarray(arr)
    return arr.view(np.dtype((np.void, arr.dtype.itemsize * arr.shape[-1])))

def nodistinctcols(M):
    MT = asvoid(M.T)
    uniqs = np.unique(MT)
    return len(uniqs)

X = np.array([np.random.randint(2, size = 16) for i in xrange(2**16)])

print("nodistinctcols(X.T) {}".format(nodistinctcols(X.T)))

基准：

In [20]: %timeit nodistinctcols(X.T)
10 loops, best of 3: 63.6 ms per loop

In [21]: %timeit nodistinctcols_orig(X.T)
1 loops, best of 3: 17.4 s per loop

其中nodistinctcols_orig由：

定义

def nodistinctcols_orig(M):
    setofcols = set()
    for column in M.T:
        setofcols.add(repr(column))
    return len(setofcols)

完整性检查通行证：

In [24]: assert nodistinctcols(X.T) == nodistinctcols_orig(X.T)

顺便说一句，定义

可能更有意义

def num_distinct_rows(M):
    return len(np.unique(asvoid(M)))

，当您希望计算不同列的数量时，只需将M.T传递给该函数。这样，如果您希望使用它来计算不同行的数量，那么函数不会因不必要的转置而变慢。

Answer 2

如果您的行数少于列数，您还可以沿着行执行多个稳定排序并计算唯一数据

def count(x):
    x = x.copy()
    x = x[x[:,0].argsort()] # first sort can be unstable
    for i in range(1, x.shape[1]):
        x = x[x[:,i].argsort(kind='mergesort')] # stable sorts now
    # x is now sorted so that equal columns are next to each other
    # -> compare neighboors with each others and count all-true columns
    return x.shape[0] - np.count_nonzero((x[1:, :] == x[:-1,:]).all(axis=1))

与numpy 1.9.dev相比，它比虚空比较快，对于较老的numpys，索引会导致性能下降（比空白慢约4倍）

X = np.array([np.random.randint(2, size = 16) for i in xrange(2**16)])
In [6]: %timeit count(X)
10 loops, best of 3: 144 ms per loop
Xt = X.T.copy()
In [8]: %timeit unutbu_void(Xt)
10 loops, best of 3: 161 ms per loop

Answer 3

仅供将来参考，请不要使用像set这样的老式方法。它会像一个聪明的numpy方法一样快速和内存效率吗？不，但现在通常它已经足够好了，当你上班时，没什么好打喷嚏的。

In [25]: %time slow = nodistinctcols(X.T)
CPU times: user 28.2 s, sys: 12 ms, total: 28.2 s
Wall time: 28.2 s

In [26]: %time medium = len(set(map(tuple, X)))
CPU times: user 324 ms, sys: 0 ns, total: 324 ms
Wall time: 322 ms

In [27]: slow == medium
Out[27]: True

慢的不是set部分，而是字符串转换。

加快计算不同列的数量

3 个答案: