Question

例如，

a = np.array([[1, 0, 0], [1, 0, 0], [2, 3, 4]])

我想要

[2, 2, 3]

有没有办法在没有for循环或使用np.vectorize的情况下执行此操作？

编辑：实际数据由1000行组成，每行100个元素，每个元素的范围从1到365.最终目标是确定具有重复项的行的百分比。这是一个家庭作业问题，我已经解决了（使用for循环），但我只是想知道是否有更好的方法来做numpy。

Answer 1

方法＃1

一种带有排序的矢量化方法 -

In [8]: b = np.sort(a,axis=1)

In [9]: (b[:,1:] != b[:,:-1]).sum(axis=1)+1
Out[9]: array([2, 2, 3])

方法＃2

ints的另一个非常大的方法是将每一行偏移一个偏移量，该偏移量将每行的元素与其他行区分开，然后进行分箱求和并计算非零数每行垃圾箱 -

n = a.max()+1
a_off = a+(np.arange(a.shape[0])[:,None])*n
M = a.shape[0]*n
out = (np.bincount(a_off.ravel(), minlength=M).reshape(-1,n)!=0).sum(1)

运行时测试

作为funcs的方法 -

def sorting(a):
    b = np.sort(a,axis=1)
    return (b[:,1:] != b[:,:-1]).sum(axis=1)+1

def bincount(a):
    n = a.max()+1
    a_off = a+(np.arange(a.shape[0])[:,None])*n
    M = a.shape[0]*n
    return (np.bincount(a_off.ravel(), minlength=M).reshape(-1,n)!=0).sum(1)

# From @wim's post   
def pandas(a):
    df = pd.DataFrame(a.T)
    return df.nunique()

# @jp_data_analysis's soln
def numpy_apply(a):
    return np.apply_along_axis(compose(len, np.unique), 1, a)

案例＃1：方形一个

In [164]: np.random.seed(0)

In [165]: a = np.random.randint(0,5,(10000,10000))

In [166]: %timeit numpy_apply(a)
     ...: %timeit sorting(a)
     ...: %timeit bincount(a)
     ...: %timeit pandas(a)
1 loop, best of 3: 1.82 s per loop
1 loop, best of 3: 1.93 s per loop
1 loop, best of 3: 354 ms per loop
1 loop, best of 3: 879 ms per loop

案例＃2：大量行

In [167]: np.random.seed(0)

In [168]: a = np.random.randint(0,5,(1000000,10))

In [169]: %timeit numpy_apply(a)
     ...: %timeit sorting(a)
     ...: %timeit bincount(a)
     ...: %timeit pandas(a)
1 loop, best of 3: 8.42 s per loop
10 loops, best of 3: 153 ms per loop
10 loops, best of 3: 66.8 ms per loop
1 loop, best of 3: 53.6 s per loop

扩展到每列的唯一元素数量

要进行扩展，我们只需要对另外两个轴进行切片和ufunc操作，就像这样 -

def nunique_percol_sort(a):
    b = np.sort(a,axis=0)
    return (b[1:] != b[:-1]).sum(axis=0)+1

def nunique_percol_bincount(a):
    n = a.max()+1
    a_off = a+(np.arange(a.shape[1]))*n
    M = a.shape[1]*n
    return (np.bincount(a_off.ravel(), minlength=M).reshape(-1,n)!=0).sum(1)

Answer 2

import numpy as np
from toolz import compose

a = np.array([[1, 0, 0], [1, 0, 0], [2, 3, 4]])

np.apply_along_axis(compose(len, np.unique), 1, a)    # [2, 2, 3]

Answer 3

您是否愿意考虑大熊猫？数据帧有一个专门的方法

firebaseUserSearch

Answer 4

使用sort的oneliner：

In [6]: np.count_nonzero(np.diff(np.sort(a)), axis=1)+1
Out[6]: array([2, 2, 3])

NumPy数组中每行的唯一元素数

4 个答案:

运行时测试