numpy的

Question

我有一个唯一行列表和另一个更大的数据数组（在示例中称为test_rows）。我想知道是否有更快的方法来获取数据中每个唯一行的位置。我想出的最快的方法是......

import numpy


uniq_rows = numpy.array([[0, 1, 0],
                         [1, 1, 0],
                         [1, 1, 1],
                         [0, 1, 1]])

test_rows = numpy.array([[0, 1, 1],
                         [0, 1, 0],
                         [0, 0, 0],
                         [1, 1, 0],
                         [0, 1, 0],
                         [0, 1, 1],
                         [0, 1, 1],
                         [1, 1, 1],
                         [1, 1, 0],
                         [1, 1, 1],
                         [0, 1, 0],
                         [0, 0, 0],
                         [1, 1, 0]])

# this gives me the indexes of each group of unique rows
for row in uniq_rows.tolist():
    print row, numpy.where((test_rows == row).all(axis=1))[0]

这打印......

[0, 1, 0] [ 1  4 10]
[1, 1, 0] [ 3  8 12]
[1, 1, 1] [7 9]
[0, 1, 1] [0 5 6]

是否有更好或更多的numpythonic（不确定该词是否存在）这样做的方法？我正在寻找一个numpy组功能，但找不到它。基本上对于任何传入的数据集，我需要以最快的方式获取该数据集中每个唯一行的位置。传入的数据集并不总是具有每个唯一的行或相同的数字。

编辑：这只是一个简单的例子。在我的应用程序中，数字不仅仅是0和32，它们可以是0到32000之间.unityq行的大小可以在4到128行之间，test_rows的大小可以是数十万行。

Answer 1

numpy的

从numpy版本1.13开始，您可以使用numpy.unique之类的np.unique(test_rows, return_counts=True, return_index=True, axis=1)

熊猫

df = pd.DataFrame(test_rows)
uniq = pd.DataFrame(uniq_rows)

uniq的

    0   1   2
0   0   1   0
1   1   1   0
2   1   1   1
3   0   1   1

或者您可以从传入的DataFrame

自动生成唯一的行

uniq_generated = df.drop_duplicates().reset_index(drop=True)

产量

    0   1   2
0   0   1   1
1   0   1   0
2   0   0   0
3   1   1   0
4   1   1   1

然后寻找它

d = dict()
for idx, row in uniq.iterrows():
    d[idx] = df.index[(df == row).all(axis=1)].values

这与您的where方法

大致相同

d

{0: array([ 1,  4, 10], dtype=int64),
 1: array([ 3,  8, 12], dtype=int64),
 2: array([7, 9], dtype=int64),
 3: array([0, 5, 6], dtype=int64)}

Answer 2

使用v1.13中的np.unique（从最新文档的source链接下载https://github.com/numpy/numpy/blob/master/numpy/lib/arraysetops.py#L112-L247）

In [157]: aset.unique(test_rows, axis=0,return_inverse=True,return_index=True)
Out[157]: 
(array([[0, 0, 0],
        [0, 1, 0],
        [0, 1, 1],
        [1, 1, 0],
        [1, 1, 1]]),
 array([2, 1, 0, 3, 7], dtype=int32),
 array([2, 1, 0, 3, 1, 2, 2, 4, 3, 4, 1, 0, 3], dtype=int32))

In [158]: a,b,c=_
In [159]: c
Out[159]: array([2, 1, 0, 3, 1, 2, 2, 4, 3, 4, 1, 0, 3], dtype=int32)
In [164]: from collections import defaultdict
In [165]: dd = defaultdict(list)
In [166]: for i,v in enumerate(c):
     ...:     dd[v].append(i)
     ...:     
In [167]: dd
Out[167]: 
defaultdict(list,
            {0: [2, 11],
             1: [1, 4, 10],
             2: [0, 5, 6],
             3: [3, 8, 12],
             4: [7, 9]})

或使用唯一行（作为可清除元组）索引字典：

In [170]: dd = defaultdict(list)
In [171]: for i,v in enumerate(c):
     ...:     dd[tuple(a[v])].append(i)
     ...:     
In [172]: dd
Out[172]: 
defaultdict(list,
            {(0, 0, 0): [2, 11],
             (0, 1, 0): [1, 4, 10],
             (0, 1, 1): [0, 5, 6],
             (1, 1, 0): [3, 8, 12],
             (1, 1, 1): [7, 9]})

Answer 3

方法＃1

这是一种方法，不确定＆＃34; NumPythonic-ness＆＃34;虽然这是一个棘手的问题 -

改进范围（关于绩效）：

<Picker x:Name="pckTheme"> <Picker.Items> <x:String>{Binding Option1Text}></x:String> <x:String>{DynamicResource Option2Text}></x:String> </Picker.Items> </Picker>可以替换为for循环，以便使用def get1Ds(a, b): # Get 1D views of each row from the two inputs # check that casting to void will create equal size elements assert a.shape[1:] == b.shape[1:] assert a.dtype == b.dtype # compute dtypes void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1])) # convert to 1d void arrays a = np.ascontiguousarray(a) b = np.ascontiguousarray(b) a_void = a.reshape(a.shape[0], -1).view(void_dt).ravel() b_void = b.reshape(b.shape[0], -1).view(void_dt).ravel() return a_void, b_void def matching_row_indices(uniq_rows, test_rows): A, B = get1Ds(uniq_rows, test_rows) validA_mask = np.in1d(A,B) sidx_A = A.argsort() validA_mask = validA_mask[sidx_A] sidx = B.argsort() sortedB = B[sidx] split_idx = np.flatnonzero(sortedB[1:] != sortedB[:-1])+1 all_split_indx = np.split(sidx, split_idx) match_mask = np.in1d(B,A)[sidx] valid_mask = np.logical_or.reduceat(match_mask, np.r_[0, split_idx]) locations = [e for i,e in enumerate(all_split_indx) if valid_mask[i]] return uniq_rows[sidx_A[validA_mask]], locations进行拆分。
np.split可以替换为slicing。

示例运行 -

np.r_

方法＃2

另一种方法是击败前一个的设置开销并从中使用np.concatenate，这将是 -

In [331]: unq_rows, idx = matching_row_indices(uniq_rows, test_rows)

In [332]: unq_rows
Out[332]: 
array([[0, 1, 0],
       [0, 1, 1],
       [1, 1, 0],
       [1, 1, 1]])

In [333]: idx
Out[333]: [array([ 1,  4, 10]),array([0, 5, 6]),array([ 3,  8, 12]),array([7, 9])]

Answer 4

这将完成这项工作：

import numpy as np
uniq_rows = np.array([[0, 1, 0],
                         [1, 1, 0],
                         [1, 1, 1],
                         [0, 1, 1]])

test_rows = np.array([[0, 1, 1],
                         [0, 1, 0],
                         [0, 0, 0],
                         [1, 1, 0],
                         [0, 1, 0],
                         [0, 1, 1],
                         [0, 1, 1],
                         [1, 1, 1],
                         [1, 1, 0],
                         [1, 1, 1],
                         [0, 1, 0],
                         [0, 0, 0],
                         [1, 1, 0]])

indices=np.where(np.sum(np.abs(np.repeat(uniq_rows,len(test_rows),axis=0)-np.tile(test_rows,(len(uniq_rows),1))),axis=1)==0)[0]
loc=indices//len(test_rows)
indices=indices-loc*len(test_rows)
res=[[] for i in range(len(uniq_rows))]
for i in range(len(indices)):
    res[loc[i]].append(indices[i])
print(res)
[[1, 4, 10], [3, 8, 12], [7, 9], [0, 5, 6]]

这适用于所有情况，包括uniq_rows中并非test_rows中的所有行都存在的情况。但是，如果你以前知道所有这些都存在，你可以替换部分

res=[[] for i in range(len(uniq_rows))]
    for i in range(len(indices)):
        res[loc[i]].append(indices[i])

只有一行：

res=np.split(indices,np.where(np.diff(loc)>0)[0]+1)

因此完全避免循环。

Answer 5

不是非常＆＃39; numpythonic＆＃39;，但是对于一些前期费用，我们可以使用键作为行的元组和索引列表来制作一个词典：

test_rowsdict = {}
for i,j in enumerate(test_rows):
    test_rowsdict.setdefault(tuple(j),[]).append(i)

test_rowsdict
{(0, 0, 0): [2, 11],
 (0, 1, 0): [1, 4, 10],
 (0, 1, 1): [0, 5, 6],
 (1, 1, 0): [3, 8, 12],
 (1, 1, 1): [7, 9]}

然后你可以根据你的uniq_rows进行过滤，使用快速的字典查找：test_rowsdict[tuple(row)]：

out = []
for i in uniq_rows:
    out.append((i, test_rowsdict.get(tuple(i),[])))

对于您的数据，我只获得16us用于查找，66us用于构建和查找，而对于np.where解决方案则为95us。

Answer 6

创建numpy_indexed包（免责声明：我是它的作者）是为了以优雅高效的方式解决此类问题：

import numpy_indexed as npi
indices = np.arange(len(test_rows))
unique_test_rows, index_groups = npi.group_by(test_rows, indices)

如果你不关心所有行的索引，只关心test_rows中的那些行，npi也有一堆简单的方法来解决这个问题; f.i：

subset_indices = npi.indices(unique_test_rows, unique_rows)

作为旁注;查看npi库中的示例可能很有用;根据我的经验，大多数时候人们会问这种问题，这些分组索引只是达到目的的手段，而不是计算的结果。有可能使用npi中的功能，您可以更有效地达到最终目标，而无需显式计算这些索引。您是否愿意为您的问题提供更多背景信息？

编辑：如果你的数组确实很大，并且总是包含少量具有二进制值的列，则使用以下编码将它们包装起来可能会进一步提高效率：

def encode(rows):
    return (rows * [[2**i for i in range(rows.shape[1])]]).sum(axis=1, dtype=np.uint8)

Answer 7

这里有很多解决方案，但我添加一个香草numpy。在大多数情况下，numpy将比列表推导和字典更快，尽管如果使用大型数组，阵列广播可能会导致内存成为问题。

np.where((uniq_rows[:, None, :] == test_rows).all(2))

非常简单，是吗？这将返回一个唯一行索引元组和相应的测试行。

 (array([0, 0, 0, 1, 1, 1, 2, 2, 3, 3, 3]),
  array([ 1,  4, 10,  3,  8, 12,  7,  9,  0,  5,  6]))

工作原理：

(uniq_rows[:, None, :] == test_rows)

使用数组广播将test_rows的每个元素与uniq_rows中的每一行进行比较。这导致4x13x3阵列。 all用于确定哪些行相等（所有比较返回true）。最后，where返回这些行的索引。

什么是更快捷的方式来获取numpy中唯一行的位置

7 个答案:

numpy的

熊猫