计算非均匀向量的距离:python列表与numpy数组

时间:2017-10-02 15:56:00

标签: python arrays numpy distance

在我的项目中(关于聚类算法,特别是k-medoids)对于能够有效地计算成对距离至关重要。我有一个约60,000个对象的数据集。问题是,必须在非均匀向量之间计算距离,即长度可能不同的向量(在这种情况下,缺失的项目被视为0)。

这是一个最小的工作示例:

# %%
MAX_LEN = 11
N = 100

import random

def manhattan_distance(vec1, vec2):
    n1, n2 = len(vec1), len(vec2)
    n = min(n1, n2)
    dist = 0
    for i in range(n):
        dist += abs(vec1[i] - vec2[i])
    if n1 > n2:
        for i in range(n, n1):
            dist += abs(vec1[i])
    else:
        for i in range(n, n2):
            dist += abs(vec2[i])
    return dist

def compute_distances():
    n = len(data)
    for i in range(n):
        for j in range(n):
            manhattan_distance(data[i], data[j])

data = []
for i in range(N):
    data.append([])
    for k in range(random.randint(5, MAX_LEN)):
        data[i].append(random.randint(0, 10))

%timeit compute_distances()

import numpy as np

def manhattan_distance_np(vec1, vec2):
    return np.absolute(vec1 - vec2).sum()

def compute_distances_np():
    n = len(data)
    for i in range(n):
        for j in range(n):
            manhattan_distance_np(data_np[i], data_np[j])

data_np = [np.append(np.asarray(d), np.zeros(MAX_LEN - len(d))) for d in data]

%timeit compute_distances_np()

我正在测试我的Python列表实现与numpy实现。 以下是结果(计算时间):

  1. Python列表:79.6 ms ± 3.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
  2. numpy数组:226 ms ± 7.18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  3. 为什么会有这么大的差异?我认为numpy数组非常快。 有没有办法改善我的代码?我误解了numpy的内部运作吗?

    编辑:将来我可能需要使用自定义距离函数进行成对距离计算。该方法也适用于长度为60'000的数据集,而不会耗尽内存。

2 个答案:

答案 0 :(得分:1)

我相信你可以让你的数组密集,并将未使用的最后一个元素设置为0。

import numpy as np
from scipy.spatial.distance import cdist, pdist, squareform


def batch_pdist(x, metric, batchsize=1000):
    dists = np.zeros((len(x), len(x)))
    for i in range(0, len(x), batchsize):
        for j in range(0, len(x), batchsize):
            dist_batch = cdist(x[i:i+batchsize], x[j:j+batchsize], metric=metric)
            dists[i:i+batchsize, j:j+batchsize] = dist_batch

    return dists


MIN_LEN = 5
MAX_LEN = 11
N = 10000
M = 10

data = []
data = np.zeros((N,MAX_LEN))
for i in range(N):
    num_nonzero = np.random.randint(MIN_LEN, MAX_LEN)
    data[i, :num_nonzero] = np.random.randint(0, M, num_nonzero)

dists = squareform(pdist(data, metric='cityblock'))
dists2 = batch_pdist(data, metric='cityblock', batchsize=500)
print((dists == dists2).all())

计时输出:

%timeit squareform(pdist(data, metric='cityblock'))
43.8 µs ± 134 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

编辑:

对于自定义距离函数,请参阅此documentation的最底部。

答案 1 :(得分:0)

我终于找到了解决这个问题的最直接的方法,而不需要改变太多的代码,只依赖于计算而不依赖于内存(因为对于非常大的数据集来说这可能是不可行的)。

基于juanpa.arrivillaga的建议,我尝试了numba,这是一个加速面向数组和数学的Python代码的库,主要针对numpy。您可以在此处阅读有关优化Python代码的优秀指南:https://jakevdp.github.io/blog/2015/02/24/optimizing-python-with-numpy-and-numba/

MAX_LEN = 11
N = 100

# Pure Python lists implementation.
import random

def manhattan_distance(vec1, vec2):
    n1, n2 = len(vec1), len(vec2)
    n = min(n1, n2)
    dist = 0
    for i in range(n):
        dist += abs(vec1[i] - vec2[i])
    if n1 > n2:
        for i in range(n, n1):
            dist += abs(vec1[i])
    else:
        for i in range(n, n2):
            dist += abs(vec2[i])
    return dist

def compute_distances():
    n = len(data)
    for i in range(n):
        for j in range(n):
            manhattan_distance(data[i], data[j])

data = []
for i in range(N):
    data.append([])
    for k in range(random.randint(5, MAX_LEN)):
        data[i].append(random.randint(0, 10))

%timeit compute_distances()

# numpy+numba implementation.
import numpy as np
from numba import jit

@jit
def manhattan_distance_np(vec1, vec2):
    return np.absolute(vec1 - vec2).sum()

@jit
def compute_distances_np():
    n = len(data)
    for i in range(n):
        for j in range(n):
            manhattan_distance_np(data_np[i], data_np[j])

data_np = np.array([np.append(np.asarray(d), np.zeros(MAX_LEN - len(d))) for d in data])

%timeit compute_distances_np()

计时输出:

%timeit compute_distances()
78.4 ms ± 3.44 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit compute_distances_np()
4.1 ms ± 14.7 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

正如您所看到的,numpy numba优化的速度提高了约19倍(不涉及其他代码优化)。