我正在尝试显着加快以下代码,但无济于事。代码接收2D数组并删除数组的行,与数组中的其他行相比,它们太相似。请参阅以下代码和评论。
as0 = a.shape[0]
for i in range(as0):
a2s0 = a.shape[0] # shape may change after each iteration
if i > (a2s0 - 1):
break
# takes the difference between all rows in array by iterating over each
# row. Then sums the absolutes. The condition finally gives a boolean
# array output - similarity condition of 0.01
t = np.sum(np.absolute(a[i,:] - a), axis=1)<0.01
# Retains the indices that are too similar and then deletes the
# necessary row
inddel = np.where(t)[0]
inddel = [k for k in inddel if k != i]
a = np.delete(a, inddel, 0)
我想知道矢量化是否可行,但我对它不太熟悉。非常感谢任何帮助。
编辑:
if i >= (a2s0 - 1): # Added equality sign
break
# Now this only calculates over rows that have not been compared.
t = np.sum(np.absolute(a[i,:] - a[np.arange(i+1,a2s0),:]), axis=1)>0.01
t = np.concatenate((np.ones(i+1,dtype=bool), t))
a = a[t, :]
答案 0 :(得分:4)
方法#1:广播
这是一种矢量化方法,在a
扩展到3D
之后使用broadcasting
,然后以矢量化方式在所有迭代中执行这些计算 -
mask = (np.absolute(a[:,None] - a)).sum(2) < 0.01
a_out = a[~np.triu(mask,1).any(0)]
方法#2:使用pdist('cityblock')
对于大型数组,我们会遇到上一种方法的内存问题。因此,作为另一种方法,我们可以使用计算曼哈顿距离的pdist 'cityblock'
,然后以平方距离矩阵形式ID对应row/col
,而不使用searchsorted
实际计算该形式而是为了有效解决它。
这是实施 -
from scipy.spatial.distance import pdist
n = a.shape[0]
dists = pdist(a, 'cityblock')
idx = np.flatnonzero(dists < thresh)
sep_idx = np.arange(n-1,0,-1).cumsum()
rm_idx = np.unique(np.searchsorted(sep_idx,idx,'right'))
a_out = np.delete(a,rm_idx,axis=0)
方法 -
# Approach#2 from this post
def remove_similar_rows(a, thresh=0.01):
n = a.shape[0]
dists = pdist(a, 'cityblock')
idx = np.flatnonzero(dists < thresh)
sep_idx = np.arange(n-1,0,-1).cumsum()
rm_idx = np.unique(np.searchsorted(sep_idx,idx,'right'))
return np.delete(a,rm_idx,axis=0)
# @John Zwinck's soln
def pairwise_manhattan_distances(a, thresh=0.01):
d = manhattan_distances(a)
return a[~np.any(np.tril(d < thresh), axis=0)]
计时 -
In [209]: a = np.random.randint(0,9,(4000,30))
# Let's set 100 rows randomly as dups
In [210]: idx0 = np.random.choice(4000,size=100, replace=0)
In [211]: idx1 = np.random.choice(4000,size=100, replace=0)
In [217]: a[idx0] = a[idx1]
In [238]: %timeit pairwise_manhattan_distances(a, thresh=0.01)
1 loops, best of 3: 225 ms per loop
In [239]: %timeit remove_similar_rows(a, thresh=0.01)
10 loops, best of 3: 100 ms per loop
答案 1 :(得分:4)
让我们创建一些假数据:
np.random.seed(0)
a = np.random.random((4,3))
现在我们有:
array([[ 0.5488135 , 0.71518937, 0.60276338],
[ 0.54488318, 0.4236548 , 0.64589411],
[ 0.43758721, 0.891773 , 0.96366276],
[ 0.38344152, 0.79172504, 0.52889492]])
接下来,我们想要所有行对的元素差异之和。我们可以使用Manhattan Distance:
d = sklearn.metrics.pairwise.manhattan_distances(a)
给出了:
array([[ 0. , 0.33859562, 0.64870931, 0.31577611],
[ 0.33859562, 0. , 0.89318282, 0.6465111 ],
[ 0.64870931, 0.89318282, 0. , 0.5889615 ],
[ 0.31577611, 0.6465111 , 0.5889615 , 0. ]])
现在你可以应用一个阈值,只保留一个三角形:
m = np.tril(d < 0.4, -1) # large threshold just for this example
得到一个布尔掩码:
array([[False, False, False, False],
[ True, False, False, False],
[False, False, False, False],
[ True, False, False, False]], dtype=bool)
这告诉你第0行是&#34;太相似&#34;第1行和第3行。现在,您可以从原始矩阵中删除掩码中任何元素为True的行:
a[~np.any(m, axis=0)] # axis can be either 0 or 1 - design choice
这给了你:
array([[ 0.54488318, 0.4236548 , 0.64589411],
[ 0.43758721, 0.891773 , 0.96366276],
[ 0.38344152, 0.79172504, 0.52889492]])
全部放在一起:
d = sklearn.metrics.pairwise.manhattan_distances(a)
a = a[~np.any(np.tril(d < 0.4, -1), axis=0)]
答案 2 :(得分:2)
第一行:
t = np.sum(np.absolute(a[i,:] - a), axis=1)<0.01
每次取一行和整个数组之间的差值的绝对值之和。这可能不是您需要的,而是尝试在数组中稍后获取当前行和行之间的差异。 您已经将所有前面的行与当前行进行了比较,为什么要再次使用?
同样从数组中删除行是一项昂贵,缓慢的操作,因此您可能会更快地检查所有行,然后删除近似重复项。您也可以不检查已经计划删除的任何行,因为您知道它们将被删除。