Question

我有一个很大的数组，看起来像下面的东西：

Object.values(routes).map((route, i) => (
    <Route key={i} 
           exact={route.exact} 
           path={route.path} 
           render={() => 
              !route.access.anonymous ? ( 
                  <Redirect to="/login"/>  
              ) : ( 
                  <route.component /> 
              )
          } 
      />
  ))

它没有排序，该数组的行是唯一的，我也知道两列中值的界限，它们分别是np.random.seed(42) arr = np.random.permutation(np.array([ (1,1,2,2,2,2,3,3,4,4,4), (8,9,3,4,7,9,1,9,3,4,50000) ]).T)和[0, n]。因此，数组的最大可能大小为[0, k]，但实际大小更接近该值的对数。

我需要按两列搜索数组，以找到(n+1)*(k+1)这样的row，并在数组中arr[row,:] = (i,j)不存在时返回-1。此类功能的简单实现是：

(i,j)

不幸的是，由于在我的情况下def get(arr, i, j): cond = (arr[:,0] == i) & (arr[:,1] == j) if np.any(cond): return np.where(cond)[0][0] else: return -1非常大（> 9000万行），所以效率非常低，尤其是因为我需要多次调用arr。

或者，我尝试使用get()键将其翻译成字典，例如

(i,j)

可以通过以下方式访问：

index[(i,j)] = row

这有效（并且在比我小的数据上进行测试时要快得多），但又可以通过以下方式即时创建dict

def get(index, i, j):
   try:
      retuen index[(i,j)]
   except KeyError:
      return -1

在我的情况下，

需要花费大量时间并占用大量RAM。我也在考虑先对index = {} for row in range(arr.shape[0]): i,j = arr[row, :] index[(i,j)] = row进行排序，然后再使用类似arr的东西，但这并没有带我到任何地方。

所以我需要一个返回的快速函数np.searchsorted

get(arr, i, j)

Answer 1

部分解决方案是：

In [36]: arr
Out[36]: 
array([[    2,     9],
       [    1,     8],
       [    4,     4],
       [    4, 50000],
       [    2,     3],
       [    1,     9],
       [    4,     3],
       [    2,     7],
       [    3,     9],
       [    2,     4],
       [    3,     1]])

In [37]: (i,j) = (2, 3)

# we can use `assume_unique=True` which can speed up the calculation    
In [38]: np.all(np.isin(arr, [i,j], assume_unique=True), axis=1, keepdims=True)
Out[38]: 
array([[False],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False]])

# we can use `assume_unique=True` which can speed up the calculation
In [39]: mask = np.all(np.isin(arr, [i,j], assume_unique=True), axis=1, keepdims=True)

In [40]: np.argwhere(mask)
Out[40]: array([[4, 0]])

如果您需要最终结果作为标量，则不要使用keepdims参数并将数组转换为标量，例如：

    # we can use `assume_unique=True` which can speed up the calculation
In [41]: mask = np.all(np.isin(arr, [i,j], assume_unique=True), axis=1)

In [42]: np.argwhere(mask)
Out[42]: array([[4]])

In [43]: np.asscalar(np.argwhere(mask))
Out[43]: 4

Answer 2

解决方案

Python提供了一种set类型来存储唯一值，但是令人遗憾的是，没有任何有序版本集。但是您可以使用ordered-set软件包。

根据数据创建一个OrderedSet。幸运的是，此操作只需执行一次：

import ordered_set

o = ordered_set.OrderedSet(map(tuple, arr))

def ordered_get(o, i, j):
    try:
        return o.index((i,j))
    except KeyError:
        return -1

运行时

根据文档，查找值的索引应为O（1）

In [46]: %timeit get(arr, 2, 3)
10.6 µs ± 39 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [47]: %timeit ordered_get(o, 2, 3)
1.16 µs ± 14.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [48]: %timeit ordered_get(o, 2, 300)
1.05 µs ± 2.67 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

针对更大的数组进行测试：

a2 = random.randint(10000, size=1000000).reshape(-1,2)
o2 = ordered_set.OrderedSet()
for t in map(tuple, a2):
    o2.add(t)

In [65]: %timeit get(a2, 2, 3)
1.05 ms ± 2.14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [66]: %timeit ordered_get(o2, 2, 3)
1.03 µs ± 2.12 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [67]: %timeit ordered_get(o2, 2, 30000)
1.06 µs ± 28.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

看起来确实是O（1）运行时。

Answer 3

plt.plot

此外，以防万一您正在考虑使用def get_agn(arr, i, j): idx = np.flatnonzero((arr[:,0] == j) & (arr[:,1] == j)) return -1 if idx.size == 0 else idx[0]解决方案，这是一个更好的解决方案（但是，在两种情况下，请参见下面的时序测试）：

ordered_set

和它的“完全”等效项（在函数内部构建字典）：

d = { (i, j): k for k, (i, j) in enumerate(arr)}
def unordered_get(d, i, j):
    return d.get((i, j), -1)

计时测试：

首先，定义@ kmario23函数：

def unordered_get_full(arr, i, j):
    d = { (i, j): k for k, (i, j) in enumerate(arr)}
    return d.get((i, j), -1)

第二，定义@ChristophTerasa函数（原始版本和完整版本）：

def get_kmario23(arr, i, j):
    # fundamentally, kmario23's code re-aranged to return scalars
    # and -1 when (i, j) not found:
    mask = np.all(np.isin(arr, [i,j], assume_unique=True), axis=1)
    idx = np.argwhere(mask)[0]
    return -1 if idx.size == 0 else np.asscalar(idx[0])

生成一些大数据：

import ordered_set
o = ordered_set.OrderedSet(map(tuple, arr))
def ordered_get(o, i, j):
    try:
        return o.index((i,j))
    except KeyError:
        return -1

def ordered_get_full(arr, i, j):
    # "Full" version that builds ordered set inside the function
    o = ordered_set.OrderedSet(map(tuple, arr))
    try:
        return o.index((i,j))
    except KeyError:
        return -1

计时结果：

arr = np.random.randint(1, 2000, 200000).reshape((-1, 2))

有序集合测试：

In [55]: %timeit get_agn(arr, *arr[-1])
149 µs ± 3.17 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [56]: %timeit get_kmario23(arr, *arr[-1])
1.42 ms ± 17.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [57]: %timeit get_kmario23(arr, *arr[0])
1.2 ms ± 14.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

无序词典测试：

In [80]: o = ordered_set.OrderedSet(map(tuple, arr))

In [81]: %timeit ordered_get(o, *arr[-1])
1.74 µs ± 32.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [82]: %timeit ordered_get_full(arr, *arr[-1]) # include ordered set creation time
166 ms ± 2.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

因此，考虑到创建有序集合或无序字典所需的时间，这些方法非常慢。您必须计划在同一数据上运行数百次搜索，才能使这些方法有意义。即使这样，也无需使用In [83]: d = { (i, j): k for k, (i, j) in enumerate(arr)} In [84]: %timeit unordered_get(d, *arr[-1]) 1.18 µs ± 21.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [85]: %timeit unordered_get_full(arr, *arr[-1]) 102 ms ± 1.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)软件包-常规词典会更快。

Answer 4

似乎我对这个问题考虑过多，有简单的解决方案。我正在考虑过滤和设置子集或使用字典index[(i,j)] = row。过滤和子设置很慢（搜索时为O（n）），而使用dict的速度较快（访问时间为O（1）），但是创建dict的速度很慢且占用大量内存。

此问题的简单解决方案是使用嵌套字典。

index = {}

for row in range(arr.shape[0]):
    i,j = arr[row, :]
    try:
        index[i][j] = row
    except KeyError:
        index[i] = {}
        index[i][j] = row

def get(index, i, j):
    try:
        return index[i][j]
    except KeyError:
        return -1

或者，代替上级命令I could use index = defaultdict(dict)，可以分配index[i][j] = row的内容直接，没有try ... except条件，但是当defaultdict(dict)函数查询不存在的{}时，i对象将创建空的get(index, i, j)，因此它将是不必要地扩展index。

第一个字典的访问时间为O（1），嵌套字典的访问时间为O（1），因此基本上是O（1）。上级字典具有可控制的大小（以n 90M行，构建嵌套字典也非常快。而且，它可以很容易地扩展到更复杂的情况。

按两列搜索大型数组

4 个答案:

解决方案

运行时

计时测试：