有没有更好的方法来确定numpy数组的交叉映射指标

时间:2015-11-13 17:32:02

标签: python arrays numpy vectorization

我需要numpy union和intersection操作的交叉映射指示。我下面的代码工作正常,但我想在将它应用于大型数据集之前进行矢量化。或者,如果有更好的,内置的,那么它是什么?

# ------- define the arrays and set operations ---------
A = np.array(['a','b','c','e','f','g','h','j'])
B = np.array(['h','i','j','k','m'])
C = np.union1d(A, B)
D = np.intersect1d(A,B)

# ------- get the mapped indicies for the union ----
zc = np.empty((len(C),3,))
zc[:]=np.nan
zc[:,0] = range(0,len(C))
for iy in range(0,len(C)):
    for ix in range(0, len(A)):
        if A[ix] == C[iy]:
            zc[iy,1] = ix
    for ix in range(0, len(B)):
        if B[ix] == C[iy]:
            zc[iy,2] = ix

# ------- get the mapped indicies for the intersection ----
zd = np.empty((len(D),3,))
zd[:]=np.nan
zd[:,0] = range(0,len(D))
for iy in range(0,len(D)):
    for ix in range(0, len(A)):
        if A[ix] == D[iy]:
            zd[iy,1] = ix
    for ix in range(0, len(B)):
        if B[ix] == D[iy]:
            zd[iy,2] = ix

1 个答案:

答案 0 :(得分:2)

对于这样的情况,您可能希望将字符串转换为数字,因为使用它们会更有效率。此外,鉴于输出是数字数组,将它们作为数字ID预先更有意义。现在,为了转换为数字ID,我看到人们使用lambda等方法,但我会使用np.unique,这对于像这样的情况非常有效。这是从数字ID转换开始的实现 -

# ------------------------ Setup work -------------------------------
_,idx1 = np.unique(np.append(A,B),return_inverse=True)
A_ID = idx1[:A.size]
B_ID = idx1[A.size:]

# ------------------------ Union work -------------------------------
# Get length of zc, which would be the max of ID+1.
lenC = idx1.max()+1

# Initialize output array zc and fill with NaNs.
zc1 = np.empty((lenC,3,))
zc1[:]=np.nan

# Fill first column with consecutive numbers starting with 0
zc1[:,0] = range(0,lenC)

# Most important part of the code :
# Set the cols-1,2 at places specified by IDs from A and B respectively
# with values from 0 to the extent of the respective IDs
zc1[A_ID,1] = np.arange(A_ID.size)
zc1[B_ID,2] = np.arange(B_ID.size)

# ------------------------ Intersection work -------------------------------
# Get intersecting indices between A and B
intersect_ID = np.argwhere(A_ID[:,None] == B_ID)

# Initialize output zd based on the number of interesects
lenD = intersect_ID.shape[0]
zd1 = np.empty((lenD,3,))
zd1[:] = np.nan

# Fill first column with consecutive numbers starting with 0
zd1[:,0] = range(0,lenD)
zd1[:,1:] = intersect_ID