Question

我需要一种内存和时间高效的方法来计算 1 到 10 维中大约 50000 个点之间的距离，在 Python 中。到目前为止，我尝试的方法都不是很好；到目前为止，我尝试过：

scipy.spatial.distance.pdist 计算全距离矩阵
scipy.spatial.KDTree.sparse_distance_matrix 计算达到阈值的稀疏距离矩阵

令我惊讶的是，sparse_distance_matrix 的表现非常糟糕。我使用的示例是从单位 5 维球中统一选择 5000 个点，其中 pdist 在 0.113 秒内返回结果，sparse_distance_matrix 在 44.966 秒内返回结果，当我使用它时最大距离截止的阈值 0.1。

此时，我会坚持使用 pdist，但如果有 50000 点，它将使用 2.5 x 10^9 条目的 numpy 数组，我担心它是否会导致运行时过载（ ?) 记忆。

有谁知道更好的方法，或者在我的实现中看到一个明显的错误？提前致谢！

以下是在 Python3 中重现输出所需的内容：

import numpy as np
import math
import time
from scipy.spatial.distance import pdist
from scipy.spatial import KDTree as kdtree

# Generate a uniform sample of size N on the unit dim-dimensional sphere (which lives in dim+1 dimensions)
def sphere(N, dim):
    # Get a random sample of points from the (dim+1)-dim. Gaussian.
    output = np.random.multivariate_normal(mean=np.zeros(dim+1), cov=np.identity(dim+1), size=N)
    # Normalize output
    output = output / np.linalg.norm(output, axis=1).reshape(-1,1)
    return output

# Generate a uniform sample of size N on the unit dim-dimensional ball.
def ball(N, dim):
    # Populate the points on the unit sphere that is the boundary.
    sphere_points = sphere(N, dim-1)
    # Randomize radii of the points on the sphere using power law to get a uniform distribution on the ball.
    radii = np.power(np.random.random(N), 1/dim)
    output = radii.reshape(-1, 1) * sphere_points
    return output

N = 5000
dim = 5
r_cutoff = 0.1
# Generate a sample to test
sample = ball(N, dim)
# Construct a KD Tree for the sample
sample_kdt = kdtree(sample)

# pdist method for distance matrix
tic = time.monotonic()
pdist(sample)
toc = time.monotonic()
print(f"Time taken from pdist = {toc-tic}")

# KD Tree method for distance matrix
tic = time.monotonic()
sample_kdt.sparse_distance_matrix(sample_kdt, r_cutoff)
toc = time.monotonic()
print(f"Time taken from the KDTree method = {toc-tic}")

Answer 1

import numpy as np
from sklearn.neighbors import BallTree

tic = time.monotonic()

tree = BallTree(sample, leaf_size=10)       
d,i = tree.query(sample, k=1)

toc = time.monotonic()

print(f"Time taken from Sklearn BallTree = {toc-tic}")

这个在我的机器上做了 Time taken from Sklearn BallTree = 0.30803330009803176。 pdist 只做了一秒钟多一点。 注意：我正在做一些繁重的计算，我的机器上有 3/4 个内核。

那个取最近的 k=1

对于半径 0.1

import numpy as np
from sklearn.neighbors import BallTree

tic = time.monotonic()

tree = BallTree(sample, leaf_size=10)       
i = tree.query_radius(sample, r=0.1)

toc = time.monotonic()

print(f"Time taken from Sklearn BallTree Radius = {toc-tic}")

速度快

Time taken from Sklearn BallTree Radius = 0.11115029989741743

Python中的有效邻近距离矩阵

1 个答案: