Python中的有效邻近距离矩阵

时间:2021-01-18 16:05:22

标签: python scipy linear-algebra distance

我需要一种内存和时间高效的方法来计算 1 到 10 维中大约 50000 个点之间的距离,在 Python 中。到目前为止,我尝试的方法都不是很好;到目前为止,我尝试过:

  • scipy.spatial.distance.pdist 计算全距离矩阵
  • scipy.spatial.KDTree.sparse_distance_matrix 计算达到阈值的稀疏距离矩阵

令我惊讶的是,sparse_distance_matrix 的表现非常糟糕。我使用的示例是从单位 5 维球中统一选择 5000 个点,其中 pdist 在 0.113 秒内返回结果,sparse_distance_matrix 在 44.966 秒内返回结果,当我使用它时最大距离截止的阈值 0.1。

此时,我会坚持使用 pdist,但如果有 50000 点,它将使用 2.5 x 10^9 条目的 numpy 数组,我担心它是否会导致运行时过载( ?) 记忆。

有谁知道更好的方法,或者在我的实现中看到一个明显的错误?提前致谢!


以下是在 Python3 中重现输出所需的内容:

import numpy as np
import math
import time
from scipy.spatial.distance import pdist
from scipy.spatial import KDTree as kdtree

# Generate a uniform sample of size N on the unit dim-dimensional sphere (which lives in dim+1 dimensions)
def sphere(N, dim):
    # Get a random sample of points from the (dim+1)-dim. Gaussian.
    output = np.random.multivariate_normal(mean=np.zeros(dim+1), cov=np.identity(dim+1), size=N)
    # Normalize output
    output = output / np.linalg.norm(output, axis=1).reshape(-1,1)
    return output

# Generate a uniform sample of size N on the unit dim-dimensional ball.
def ball(N, dim):
    # Populate the points on the unit sphere that is the boundary.
    sphere_points = sphere(N, dim-1)
    # Randomize radii of the points on the sphere using power law to get a uniform distribution on the ball.
    radii = np.power(np.random.random(N), 1/dim)
    output = radii.reshape(-1, 1) * sphere_points
    return output

N = 5000
dim = 5
r_cutoff = 0.1
# Generate a sample to test
sample = ball(N, dim)
# Construct a KD Tree for the sample
sample_kdt = kdtree(sample)

# pdist method for distance matrix
tic = time.monotonic()
pdist(sample)
toc = time.monotonic()
print(f"Time taken from pdist = {toc-tic}")

# KD Tree method for distance matrix
tic = time.monotonic()
sample_kdt.sparse_distance_matrix(sample_kdt, r_cutoff)
toc = time.monotonic()
print(f"Time taken from the KDTree method = {toc-tic}")

1 个答案:

答案 0 :(得分:1)

import numpy as np
from sklearn.neighbors import BallTree

tic = time.monotonic()

tree = BallTree(sample, leaf_size=10)       
d,i = tree.query(sample, k=1)

toc = time.monotonic()

print(f"Time taken from Sklearn BallTree = {toc-tic}")

这个在我的机器上做了 Time taken from Sklearn BallTree = 0.30803330009803176pdist 只做了一秒钟多一点。 注意:我正在做一些繁重的计算,我的机器上有 3/4 个内核。

那个取最近的 k=1


对于半径 0.1

import numpy as np
from sklearn.neighbors import BallTree

tic = time.monotonic()

tree = BallTree(sample, leaf_size=10)       
i = tree.query_radius(sample, r=0.1)

toc = time.monotonic()

print(f"Time taken from Sklearn BallTree Radius = {toc-tic}")

速度快

Time taken from Sklearn BallTree Radius = 0.11115029989741743