根据其他数组中的存在/不存在从numpy数组中删除行

时间:2016-01-22 17:24:52

标签: python arrays numpy

我有3个不同的numpy数组,但它们都以两列开头,其中包含一年中的某一天和时间。例如:

   dyn = [[  83   12   7.10555687e-01 ...,   6.99242766e-01   6.868761e-01]
         [  83   13   8.28091972e-01 ...,   8.33734118e-01   8.47266838e-01]
         [  83   14   8.79437354e-01 ...,   8.73598144e-01   8.57156213e-01]
         [  161   23   3.28109488e-01 ...,   2.83043689e-01  2.59775391e-01]
         [  162   0    2.23502046e-01 ...,   1.96972086e-01  1.65565263e-01]
         [  162   1   2.51653976e-01 ...,   2.17209188e-01   1.42133495e-1]]

   us = [[  133   18   3.00483815e+02 ...,   1.94277561e+00   2.8168959e+00]
        [  133   19   2.98832620e+02 ...,   2.42506475e+00   2.99730800e+00]
        [  133   20   2.96706105e+02 ...,   3.16851622e+00   4.41187088e+00]
        [  161   23   2.88336560e+02 ...,   3.44864070e-01   3.85055635e-01]
        [  162   0    2.87593240e+02 ...,   2.93002410e-01   2.67112490e-01]
        [  162   2    2.86992180e+02 ...,   7.08996730e-02   2.6403210e-01]]

我需要能够删除所有3个数组中不存在特定日期和时间的行。换句话说,所以我留下3个阵列,其中前3列在3个阵列中的每个阵列中是相同的。

因此产生的较小数组将是:

dyn= [[  161   23   3.28109488e-01 ...,   2.83043689e-01  2.59775391e-01]
     [  162   0    2.23502046e-01 ...,   1.96972086e-01  1.65565263e-01]]

us= [[  161   23   2.88336560e+02 ...,   3.44864070e-01   3.85055635e-01]
    [  162   0    2.87593240e+02 ...,   2.93002410e-01   2.67112490e-01]]

(但后来也受限于第三阵列中的内容)

我尝试过使用sort / zip但不确定它是否应该应用于2D数组:

X= dyn
Y = us
xsorted=[x for (y,x) in sorted(zip(Y[:,1],X[:,1]), key=lambda pair: pair[0])]

还有一个循环,但只有当相同的时间/天在数组中的相同位置时才有效,这是没有用的

for i in range(100):
     dyn_small=dyn[dyn[:,0]==us[i,0]]

2 个答案:

答案 0 :(得分:0)

假设ABC作为输入数组,这是一个大量使用broadcasting的矢量化方法 -

# Get masks comparing all rows of A with B and then B with C
M1 = (A[:,None,:2] == B[:,:2])
M2 = (B[:,None,:2] == C[:,:2])

# Get a joint 3D mask of those two masks and get the indices of matches.
# These indices (I,J,K) of the 3D mask basically tells us the row numbers 
# correspondng to each of the input arrays that are present in all of them.
# Thus, in (I,J,K), I would be the matching row number in A, J in B & K in C.
I,J,K = np.where((M1[:,:,None,:] & M2).all(3))

# Finally, select rows of A, B and C with I, J and K respectively
A_new = A[I]
B_new = B[J]
C_new = C[K]

示例运行 -

1)输入:

In [116]: A
Out[116]: 
array([[ 83,  12, 443],
       [ 83,  13, 565],
       [ 83,  14, 342],
       [161,  23, 431],
       [162,   0, 113],
       [162,   1, 313]])

In [117]: B
Out[117]: 
array([[161,  23, 999],
       [  5,   1,  13],
       [ 83,  12,  15],
       [162,   0,  12],
       [  4,   3,  11]])

In [118]: C
Out[118]: 
array([[ 11,  23, 143],
       [162,   0, 113],
       [161,  23, 545]])

2)运行解决方案代码以获取匹配的行ID,从而提取行:

In [119]: M1 = (A[:,None,:2] == B[:,:2])
     ...: M2 = (B[:,None,:2] == C[:,:2])
     ...: 

In [120]: I,J,K = np.where((M1[:,:,None,:] & M2).all(3))

In [121]: A[I]
Out[121]: 
array([[161,  23, 431],
       [162,   0, 113]])

In [122]: B[J]
Out[122]: 
array([[161,  23, 999],
       [162,   0,  12]])

In [123]: C[K]
Out[123]: 
array([[161,  23, 545],
       [162,   0, 113]])

答案 1 :(得分:0)

numpy_indexed包(免责声明:我是它的作者)包含以优雅,高效/矢量化的方式解决此类问题的功能:

import numpy as np
import numpy_indexed as npi

dyn = np.array(dyn)
us = np.array(us)

dyn_index = npi.as_index(dyn[:, :2])
us_index = npi.as_index(us[:, :2])

common = npi.intersection(dyn_index, us_index)
print(common)
print(dyn[npi.contains(common, dyn_index)])
print(us[npi.contains(common, us_index)])

注意性能NlogN最坏的情况;并且as_index的参数已经按排序顺序排列。相比之下,目前接受的答案是输入大小的二次方。