Question

具有两个numpy数组（samples vs clusters）：

data(n_samples, n_featuers)
clusters(n_clusters, n_features)

目标是计算与每个样本最近的簇的索引的numpy数组：

new_assignments(n_samples)

当前代码如下：

def assign_clusters_to_samples(data, clusters, assignments):
    # clusters-array of clusters, sample-single sample from the database
    def get_index_from_euclidean_distances(clusters, sample):
        e_distances = np.sqrt(np.sum(np.power(np.subtract(clusters,sample),2), axis=1))
        # return index with the minimal distance
        return np.where(e_distances==np.min(e_distances))[0]

    new_assignments = np.empty((0,1), int)
    # iterate through all samples
    for i in range(data.shape[0]):
        new_assignments = np.append(new_assignments, get_index_from_euclidean_distances(clusters,data[i]))
    # return new assignments and True if there is a difference to last assignments, False otherwise
    return new_assignments, find_difference(new_assignments, assignments)

但是它非常慢。如何使过程更快？还有其他解决问题的最佳方法吗？

编辑：

我上面的代码是k-means聚类算法的核心部分，占99.9％的执行时间。我正在从头开始构建它，以用于教育目的，并通过您的回答得到了我所需要的。（对不起，对于早期的编辑和混乱，这是我的第一个问题，以后的问题将更具体，并提供调试和调试所需的所有信息和数据。问题重现性）

谢谢Sobek。将np.apply_along_axis的性能从original提升到apply_along_axis。

我将继续构建Eli Korvigo建议的解决方案。

非常感谢您！

Answer 1

修改

让我们假设，您在N维向量空间中有一组C重心点（clusters），具有欧几里得度量和一组Q查询点（{{ 1}}）。现在，如果您想为每个查询点找到最接近的质心，则可以使用搜索树（例如K-D tree）在大约samples中进行操作，而当前的方法是{{1} }。

O(QlogC)

原始答案（包括后记）

我看到循环内有O(Q**2)调用，对于优化欠佳的代码，通常将其视为危险信号，因为NumPy数组不是动态的：In [1]: import numpy as np In [2]: from sklearn.neighbors import DistanceMetric, KDTree In [3]: clusters = np.array([ ...: [0, 1], ...: [10, 5] ...: ]) In [4]: tree = KDTree(clusters, metric=DistanceMetric.get_metric('euclidean')) In [5]: samples = np.array([ ...: [0, 2], ...: [10, 6] ...: ]) In [6]: tree.query(samples, return_distance=False) Out[6]: array([[0], [1]])必须在每次迭代时重新分配并复制其操作数。您最好在列表中累积数组并在结果列表中调用np.append。

np.append

P.S。

我不确定您是否在没有故意指定np.concatenate的情况下调用def assign_clusters_to_samples(data, clusters, assignments): # clusters-array of clusters, sample-single sample from the database def euclidean_distances(clusters, sample): e_distances = np.sqrt(np.sum(np.power(np.subtract(clusters,sample),2), axis=1)) # return index with the minimal distance return np.where(e_distances==np.min(e_distances))[0] # iterate through all samples acc = [euclidean_distances(clusters, data[i]).flatten() for i in range(data.shape[0])] new_assignments = np.concatenate(acc) # return new assignments and True if there is a difference to last assignments, False otherwise return new_assignments, find_difference(new_assignments, assignments)（毕竟，原始np.append对象显然是非平坦的）：函数（以及扩展名），我的解决方案）在追加/串联之前将axis的返回值变平。
您的算法不是特别有效。任何距离搜索树的数据结构都将大大改善时间复杂度。
从设计角度而言，我认为您不应该在此函数内调用new_assignments。从我的角度来看，这是一个更清洁的解决方案：
```
euclidean_distances
```

Answer 2

读取euclidean_distances非常困难，因为您不使用数学运算符，而是使用numpy方法。使用numpy.append非常慢，因为每次必须复制整个数组。

def assign_clusters_to_samples(data, clusters, assignments):
    # clusters-array of clusters, sample-single sample from the database
    def euclidean_distances(clusters, sample):
        e_distances = np.sum((clusters - sample)**2, axis=1)
        # return index with the minimal distance
        return np.argmin(e_distances)

    new_assignments = [
        euclidean_distances(clusters,d)
        for d in data
    ]
    # return new assignments and True if there is a difference to last assignments, False otherwise
    return new_assignments, find_difference(new_assignments, assignments)

如何加快这种阵列间过程？ [Python，Numpy]

2 个答案: