Question

我需要计算所有行的差异（〜分数）与完整2d阵列的所有其他差异（得分所需的分数（计算用于统计的数组的“差异距离”））。这里有一个简单的例子，但是我需要在大约2万个数组~100 000行和数千行上做到这一点，所以我正在寻找加速我的天真代码：

a = numpy.array([[1,2],[1,2],[1,3],[2,3],[3,3]])
score =0
scoresquare = 0
for i in xrange(len(a)):
    for j in range(i+1,len(a)):
        scoretemp = 0
        if a[i,0]!=a[j,0] and a[i,1]!=a[j,0] and a[i,1]!=a[j,1] and a[i,0]!=a[j,1] :
            # comparison gives two different items
            scoretemp = 2
        elif (a[i]==a[j]).all():
            scoretemp = 0
        else:
            scoretemp=1
        print a[i],a[j],scoretemp, (a[i]==a[j]).all(),(a[i]==a[j]).any()
        score += scoretemp
        scoresquare += (scoretemp*scoretemp)       
print score,scoresquare

a [0]与[1]相同，所以得分（差异数）= 0，但与[2]有一个差异，与[3]有两个差异。到计算这样的距离（统计数据），我需要中间平方得分和得分。

reference_row  compared_row  score
[1 2]          [1 2]         0  
[1 2]          [1 3]         1 
[1 2]          [2 3]         1 
[1 2]          [3 3]         2  
[1 2]          [1 3]         1 
[1 2]          [2 3]         1  
[1 2]          [3 3]         2  
[1 3]          [2 3]         1  
[1 3]          [3 3]         1  
[2 3]          [3 3]         1  
Sum_score=11 Sum_scoresquare=15

我的代码非常天真，并没有充分利用数据的全部优势，所以：如何加速这样的计算？谢谢你的帮助

Answer 1

np.in1d搜索 array2 中 array1 的每个元素，并为匹配生成 True 。所以我们需要使用~np.in1d否定结果。之后np.where会给出那些包含 True 值的索引，因此len(np.where(...))会给出总不匹配。我希望这会对你有所帮助：

>>> import numpy as np
>>> a = np.array([[1,2],[1,2],[1,3],[2,3],[3,3]])
>>> res=[len(np.where(~np.in1d(a[p],a[q]))[0]) for p in range(a.shape[0]) for q in range(p+1,a.shape[0])]
>>> res=np.array(res)
>>> Sum_score=sum(res)
>>> Sum_score_square=sum(res*res)
>>> print Sum_score, Sum_score_square
11 15
>>> k=0
>>> for i in range(a.shape[0]):
...     for j in range(i+1,a.shape[0]):
...         print a[i],a[j],res[k]
...         k+=1


[1 2] [1 2] 0
[1 2] [1 3] 1
[1 2] [2 3] 1
[1 2] [3 3] 2
[1 2] [1 3] 1
[1 2] [2 3] 1
[1 2] [3 3] 2
[1 3] [2 3] 1
[1 3] [3 3] 1
[2 3] [3 3] 1

计算2d数组中行间差异数的最快方法

1 个答案: