我需要计算相对较大数组中不同列的数量。
def nodistinctcols(M):
setofcols = set()
for column in M.T:
setofcols.add(repr(column))
return len(setofcols)
X = np.array([np.random.randint(2, size = 16) for i in xrange(2**16)])
print "nodistinctcols(X.T)", nodistinctcols(X.T)
最后一行在我的电脑上需要20秒,这似乎过于缓慢。相比之下,X = np.array([np.random.randint(2, size = 16) for i in xrange(2**16)])
需要216毫秒。 nodistinctcols
可以加速吗?
答案 0 :(得分:3)
您可以使用view
更改M
的dtype,以便将整行(或列)视为字节数组。然后可以应用np.unique
来查找唯一值:
import numpy as np
def asvoid(arr):
"""
View the array as dtype np.void (bytes).
This views the last axis of ND-arrays as np.void (bytes) so
comparisons can be performed on the entire row.
http://stackoverflow.com/a/16840350/190597 (Jaime, 2013-05)
Some caveats:
- `asvoid` will work for integer dtypes, but be careful if using asvoid on float
dtypes, since float zeros may compare UNEQUALLY:
>>> asvoid([-0.]) == asvoid([0.])
array([False], dtype=bool)
- `asvoid` works best on contiguous arrays. If the input is not contiguous,
`asvoid` will copy the array to make it contiguous, which will slow down the
performance.
"""
arr = np.ascontiguousarray(arr)
return arr.view(np.dtype((np.void, arr.dtype.itemsize * arr.shape[-1])))
def nodistinctcols(M):
MT = asvoid(M.T)
uniqs = np.unique(MT)
return len(uniqs)
X = np.array([np.random.randint(2, size = 16) for i in xrange(2**16)])
print("nodistinctcols(X.T) {}".format(nodistinctcols(X.T)))
基准:
In [20]: %timeit nodistinctcols(X.T)
10 loops, best of 3: 63.6 ms per loop
In [21]: %timeit nodistinctcols_orig(X.T)
1 loops, best of 3: 17.4 s per loop
其中nodistinctcols_orig
由:
def nodistinctcols_orig(M):
setofcols = set()
for column in M.T:
setofcols.add(repr(column))
return len(setofcols)
完整性检查通行证:
In [24]: assert nodistinctcols(X.T) == nodistinctcols_orig(X.T)
顺便说一句,定义
可能更有意义def num_distinct_rows(M):
return len(np.unique(asvoid(M)))
,当您希望计算不同列的数量时,只需将M.T
传递给该函数。这样,如果您希望使用它来计算不同行的数量,那么函数不会因不必要的转置而变慢。
答案 1 :(得分:2)
如果您的行数少于列数,您还可以沿着行执行多个稳定排序并计算唯一数据
def count(x):
x = x.copy()
x = x[x[:,0].argsort()] # first sort can be unstable
for i in range(1, x.shape[1]):
x = x[x[:,i].argsort(kind='mergesort')] # stable sorts now
# x is now sorted so that equal columns are next to each other
# -> compare neighboors with each others and count all-true columns
return x.shape[0] - np.count_nonzero((x[1:, :] == x[:-1,:]).all(axis=1))
与numpy 1.9.dev相比,它比虚空比较快,对于较老的numpys,索引会导致性能下降(比空白慢约4倍)
X = np.array([np.random.randint(2, size = 16) for i in xrange(2**16)])
In [6]: %timeit count(X)
10 loops, best of 3: 144 ms per loop
Xt = X.T.copy()
In [8]: %timeit unutbu_void(Xt)
10 loops, best of 3: 161 ms per loop
答案 2 :(得分:2)
仅供将来参考,请不要使用像set
这样的老式方法。它会像一个聪明的numpy方法一样快速和内存效率吗?不,但现在通常它已经足够好了,当你上班时,没什么好打喷嚏的。
In [25]: %time slow = nodistinctcols(X.T)
CPU times: user 28.2 s, sys: 12 ms, total: 28.2 s
Wall time: 28.2 s
In [26]: %time medium = len(set(map(tuple, X)))
CPU times: user 324 ms, sys: 0 ns, total: 324 ms
Wall time: 322 ms
In [27]: slow == medium
Out[27]: True
慢的不是set
部分,而是字符串转换。