2D numpy数组搜索(相当于Matlab' rows' rows'选项)

时间:2014-06-10 15:20:21

标签: python arrays numpy

我有两个4列numpy数组(2D),每行有几百个(浮点数)行(cap和usp)。考虑每个数组中3列的子集(例如capind=cap[:,:3]):

  1. 两个数组之间有许多常见的行。
  2. 每行元组/“三元组”在每个数组中都是唯一的。
  3. 我正在寻找一种有效的方法来识别两个阵列中这些常见的三个值(行)子集,同时保留两个阵列中的第4列以进行进一步处理。从本质上讲,我正在寻找一种很好的方法,用相当于行选项的Matlab交叉函数(即([c, ia, ib]=intersect(capind, uspind, 'rows');))。

    返回匹配行的索引,以便从原始数组(matchcap=cap[ia,:])的第4列获取匹配的三元组和值是微不足道的。


    我目前的方法是基于论坛上的类似问题,因为我找不到与我的问题相匹配的问题。然而,考虑到我的目标(我还没有完全解决我的问题),这种方法似乎有点低效:

    数组是这样的:

    cap=array([[  2.50000000e+01,   1.27000000e+02,   1.00000000e+00,
          9.81997200e-06],
       [  2.60000000e+01,   1.27000000e+02,   1.00000000e+00,
          9.14296800e+00],
       [  2.70000000e+01,   1.27000000e+02,   1.00000000e+00,
          2.30137100e-04],
       ...,
       [  6.10000000e+01,   1.80000000e+02,   1.06000000e+02,
          8.44939900e-03],
       [  6.20000000e+01,   1.80000000e+02,   1.06000000e+02,
          4.77729100e-03],
       [  6.30000000e+01,   1.80000000e+02,   1.06000000e+02,
          1.40343500e-03]])
    
    usp=array([[  4.10000000e+01,   1.31000000e+02,   1.00000000e+00,
          5.24197200e-06],
       [  4.20000000e+01,   1.31000000e+02,   1.00000000e+00,
          8.39178800e-04],
       [  4.30000000e+01,   1.31000000e+02,   1.00000000e+00,
          1.20279900e+01],
       ...,
       [  4.70000000e+01,   1.80000000e+02,   1.06000000e+02,
          2.48667700e-02],
       [  4.80000000e+01,   1.80000000e+02,   1.06000000e+02,
          4.23304600e-03],
       [  4.90000000e+01,   1.80000000e+02,   1.06000000e+02,
          1.02051300e-03]])
    

    然后我将每个4列数组(usp和cap)转换为三列数组(capind和uspind如下所示为整数,以便于查看)。

    capind=array([[ 25, 127,   1],
       [ 26, 127,   1],
       [ 27, 127,   1],
       ...,
       [ 61, 180, 106],
       [ 62, 180, 106],
       [ 63, 180, 106]])
    uspind=array([[ 41, 131,   1],
       [ 42, 131,   1],
       [ 43, 131,   1],
       ...,
       [ 47, 180, 106],
       [ 48, 180, 106],
       [ 49, 180, 106]])
    

    使用set操作为我提供了匹配的三元组:carray=np.array([x for x in set(tuple(x) for x in capind) & set(tuple(x) for x in uspind)])

    这对于从uspind和capind数组中查找公共行值似乎相当不错。我现在需要从匹配的行中获取第4列值(即将carray与原始源数组的前三列(cap和usp)进行比较,并以某种方式从第4列中获取值)。

    有更好的更有效的方法来实现这一目标吗?否则,非常感谢任何有关从源数组中检索第4列值的最佳方法的帮助。

4 个答案:

答案 0 :(得分:2)

尝试使用词典。

capind = {tuple(row[:3]):row[3] for row in cap}
uspind = {tuple(row[:3]):row[3] for row in usp}

keys = capind.viewkeys() & uspind.viewkeys()
for key in keys:
    # capind[key] and uspind[key] are the fourth columns

答案 1 :(得分:2)

使用假设您在每个矩阵中行是唯一的并且存在公共行,这是一个解决方案。基本思想是连接两个数组,对它进行排序,使相似的行在一起,然后在各行之间做差异。如果行相同,前三个值应该接近于零。

[原文]

## Concatenate the matrices together
cu = concatenate( (cap, usp), axis=0 )
print cu

## Sort it
cu.sort( axis=0 ) 
print cu

## Do a forward difference from row to row
cu_diff = diff( cu, n=1, axis=0 )

## Now calculate the sum of the first three columns 
##  as it should be zero (or near zero)
cu_diff_s = sum( abs( cu_diff[:,:-1] ), axis=1 ) 

## Find the indices where it is zero
##  Change this to be <= eps if you are using float numbers 
indices = find( cu_diff_s == 0 )
print indices

## And here are the rows...
print cu[indices,:]

我根据你上面的例子设计了一个数据集。它似乎工作。可能有更快的方法,但这样你就不必循环任何东西了。 (我不喜欢循环:-))。

[更新]

确定。所以我在每个矩阵中添加了两列。最后一列是cap中的1和usp中的2。最后一列只是原始矩阵的索引。

## Store more info in the array
##  The first 4 columns are the initial data
##  The fifth column is a code of 1 or 2 (ie cap or usp)
##  The sixth column is the index into the original matrix

cap_code = concatenate(  (ones( (cap.shape[0], 1 )), reshape( r_[0:cap.shape[0]], (cap.shape[0], 1))), axis=1 )
cap_info = concatenate( (cap, cap_code ), axis=1 )

usp_code = concatenate(  (2*ones( (usp.shape[0], 1 )), reshape( r_[0:usp.shape[0]], (usp.shape[0], 1))), axis=1 )
usp_info = concatenate( (usp, usp_code ), axis=1 )

## Concatenate the matrices together
cu = concatenate( (cap_info, usp_info), axis=0 )
print cu

## Sort it
cu.sort( axis=0 )
print cu

## Do a forward difference from row to row
cu_diff = diff( cu, n=1, axis=0 )

## Now calculate the sum of the first three columns 
##  as it should be zero (or near zero)
cu_diff_s = sum( abs( cu_diff[:,:3] ), axis=1 )

## Find the indices where it is zero
##  Change this to be <= eps if you are using float numbers 
indices = find( cu_diff_s == 0 )
print indices

## And here are the rows...
print cu[indices,:]
print cu[indices+1,:]

它似乎是基于我的人为数据而工作的。它有点令人费解,所以我认为我不想进一步追求这个方向。

祝你好运!

答案 2 :(得分:0)

Matlab相当于使用numpy返回行索引,如下所示,它返回一个布尔数组,对于相同行的索引为1:

def find_rows_in_array(arr, rows):
    '''
    find indices of rows in array if they exist
    '''
    tmp = np.prod(np.swapaxes(
        arr[:, :, None], 1, 2) == rows, axis=2)
    return np.sum(np.cumsum(tmp, axis=0) * tmp == 1,
                  axis=1) > 0

以上仅返回非重复行的索引。如果你想返回每一行,那么:

def find_rows_in_array(arr, rows):
    '''
    find indices of rows in array if they exist
    '''
    tmp = np.prod(np.swapaxes(
        arr[:, :, None], 1, 2) == rows, axis=2)
    return np.sum(tmp,
                  axis=1) > 0

这要快得多。您可以将数组作为输入进行交换,以便为每个数组查找相应的索引。享受:D

答案 3 :(得分:0)

numpy_indexed包(免责声明:我是它的作者)包含您需要的所有功能,以有效的方式实现(即完全矢量化,因此在python级别没有慢循环):

import numpy_indexed as npi
c = npi.intersection(capind, uspind)
ia = npi.indices(capind, c)
ib = npi.indices(uspind, c)

根据您对简洁与绩效的关注程度,您可能更愿意:

import numpy_indexed as npi
a = npi.as_index(capind)
b = npi.as_index(uspind)
c = npi.intersection(a, b)
ia = npi.indices(a, c)
ib = npi.indices(b, c)