在我的项目中(关于聚类算法,特别是k-medoids)对于能够有效地计算成对距离至关重要。我有一个约60,000个对象的数据集。问题是,必须在非均匀向量之间计算距离,即长度可能不同的向量(在这种情况下,缺失的项目被视为0)。
这是一个最小的工作示例:
# %%
MAX_LEN = 11
N = 100
import random
def manhattan_distance(vec1, vec2):
n1, n2 = len(vec1), len(vec2)
n = min(n1, n2)
dist = 0
for i in range(n):
dist += abs(vec1[i] - vec2[i])
if n1 > n2:
for i in range(n, n1):
dist += abs(vec1[i])
else:
for i in range(n, n2):
dist += abs(vec2[i])
return dist
def compute_distances():
n = len(data)
for i in range(n):
for j in range(n):
manhattan_distance(data[i], data[j])
data = []
for i in range(N):
data.append([])
for k in range(random.randint(5, MAX_LEN)):
data[i].append(random.randint(0, 10))
%timeit compute_distances()
import numpy as np
def manhattan_distance_np(vec1, vec2):
return np.absolute(vec1 - vec2).sum()
def compute_distances_np():
n = len(data)
for i in range(n):
for j in range(n):
manhattan_distance_np(data_np[i], data_np[j])
data_np = [np.append(np.asarray(d), np.zeros(MAX_LEN - len(d))) for d in data]
%timeit compute_distances_np()
我正在测试我的Python列表实现与numpy
实现。
以下是结果(计算时间):
79.6 ms ± 3.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
numpy
数组:226 ms ± 7.18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
为什么会有这么大的差异?我认为numpy
数组非常快。
有没有办法改善我的代码?我误解了numpy
的内部运作吗?
编辑:将来我可能需要使用自定义距离函数进行成对距离计算。该方法也适用于长度为60'000的数据集,而不会耗尽内存。
答案 0 :(得分:1)
我相信你可以让你的数组密集,并将未使用的最后一个元素设置为0。
import numpy as np
from scipy.spatial.distance import cdist, pdist, squareform
def batch_pdist(x, metric, batchsize=1000):
dists = np.zeros((len(x), len(x)))
for i in range(0, len(x), batchsize):
for j in range(0, len(x), batchsize):
dist_batch = cdist(x[i:i+batchsize], x[j:j+batchsize], metric=metric)
dists[i:i+batchsize, j:j+batchsize] = dist_batch
return dists
MIN_LEN = 5
MAX_LEN = 11
N = 10000
M = 10
data = []
data = np.zeros((N,MAX_LEN))
for i in range(N):
num_nonzero = np.random.randint(MIN_LEN, MAX_LEN)
data[i, :num_nonzero] = np.random.randint(0, M, num_nonzero)
dists = squareform(pdist(data, metric='cityblock'))
dists2 = batch_pdist(data, metric='cityblock', batchsize=500)
print((dists == dists2).all())
计时输出:
%timeit squareform(pdist(data, metric='cityblock'))
43.8 µs ± 134 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
编辑:
对于自定义距离函数,请参阅此documentation的最底部。
答案 1 :(得分:0)
我终于找到了解决这个问题的最直接的方法,而不需要改变太多的代码,只依赖于计算而不依赖于内存(因为对于非常大的数据集来说这可能是不可行的)。
基于juanpa.arrivillaga的建议,我尝试了numba
,这是一个加速面向数组和数学的Python代码的库,主要针对numpy
。您可以在此处阅读有关优化Python代码的优秀指南:https://jakevdp.github.io/blog/2015/02/24/optimizing-python-with-numpy-and-numba/。
MAX_LEN = 11
N = 100
# Pure Python lists implementation.
import random
def manhattan_distance(vec1, vec2):
n1, n2 = len(vec1), len(vec2)
n = min(n1, n2)
dist = 0
for i in range(n):
dist += abs(vec1[i] - vec2[i])
if n1 > n2:
for i in range(n, n1):
dist += abs(vec1[i])
else:
for i in range(n, n2):
dist += abs(vec2[i])
return dist
def compute_distances():
n = len(data)
for i in range(n):
for j in range(n):
manhattan_distance(data[i], data[j])
data = []
for i in range(N):
data.append([])
for k in range(random.randint(5, MAX_LEN)):
data[i].append(random.randint(0, 10))
%timeit compute_distances()
# numpy+numba implementation.
import numpy as np
from numba import jit
@jit
def manhattan_distance_np(vec1, vec2):
return np.absolute(vec1 - vec2).sum()
@jit
def compute_distances_np():
n = len(data)
for i in range(n):
for j in range(n):
manhattan_distance_np(data_np[i], data_np[j])
data_np = np.array([np.append(np.asarray(d), np.zeros(MAX_LEN - len(d))) for d in data])
%timeit compute_distances_np()
计时输出:
%timeit compute_distances()
78.4 ms ± 3.44 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit compute_distances_np()
4.1 ms ± 14.7 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
正如您所看到的,numpy
numba
优化的速度提高了约19倍(不涉及其他代码优化)。