我需要计算所有行的差异(〜分数)与完整2d阵列的所有其他差异(得分所需的分数(计算用于统计的数组的“差异距离”))。这里有一个简单的例子,但是我需要在大约2万个数组~100 000行和数千行上做到这一点,所以我正在寻找加速我的天真代码:
a = numpy.array([[1,2],[1,2],[1,3],[2,3],[3,3]])
score =0
scoresquare = 0
for i in xrange(len(a)):
for j in range(i+1,len(a)):
scoretemp = 0
if a[i,0]!=a[j,0] and a[i,1]!=a[j,0] and a[i,1]!=a[j,1] and a[i,0]!=a[j,1] :
# comparison gives two different items
scoretemp = 2
elif (a[i]==a[j]).all():
scoretemp = 0
else:
scoretemp=1
print a[i],a[j],scoretemp, (a[i]==a[j]).all(),(a[i]==a[j]).any()
score += scoretemp
scoresquare += (scoretemp*scoretemp)
print score,scoresquare
a [0]与[1]相同,所以得分(差异数)= 0,但与[2]有一个差异,与[3]有两个差异。到计算这样的距离(统计数据),我需要中间平方得分和得分。
reference_row compared_row score
[1 2] [1 2] 0
[1 2] [1 3] 1
[1 2] [2 3] 1
[1 2] [3 3] 2
[1 2] [1 3] 1
[1 2] [2 3] 1
[1 2] [3 3] 2
[1 3] [2 3] 1
[1 3] [3 3] 1
[2 3] [3 3] 1
Sum_score=11 Sum_scoresquare=15
我的代码非常天真,并没有充分利用数据的全部优势,所以:如何加速这样的计算?谢谢你的帮助
答案 0 :(得分:1)
np.in1d
搜索 array2 中 array1 的每个元素,并为匹配生成 True 。所以我们需要使用~np.in1d
否定结果。之后np.where
会给出那些包含 True 值的索引,因此len(np.where(...))
会给出总不匹配。我希望这会对你有所帮助:
>>> import numpy as np
>>> a = np.array([[1,2],[1,2],[1,3],[2,3],[3,3]])
>>> res=[len(np.where(~np.in1d(a[p],a[q]))[0]) for p in range(a.shape[0]) for q in range(p+1,a.shape[0])]
>>> res=np.array(res)
>>> Sum_score=sum(res)
>>> Sum_score_square=sum(res*res)
>>> print Sum_score, Sum_score_square
11 15
>>> k=0
>>> for i in range(a.shape[0]):
... for j in range(i+1,a.shape[0]):
... print a[i],a[j],res[k]
... k+=1
[1 2] [1 2] 0
[1 2] [1 3] 1
[1 2] [2 3] 1
[1 2] [3 3] 2
[1 2] [1 3] 1
[1 2] [2 3] 1
[1 2] [3 3] 2
[1 3] [2 3] 1
[1 3] [3 3] 1
[2 3] [3 3] 1