如何加快这种阵列间过程? [Python,Numpy]

时间:2018-09-30 12:27:52

标签: python numpy

具有两个numpy数组(samples vs clusters):

data(n_samples, n_featuers)
clusters(n_clusters, n_features)

目标是计算与每个样本最近的簇的索引的numpy数组:

new_assignments(n_samples)

当前代码如下:

def assign_clusters_to_samples(data, clusters, assignments):
    # clusters-array of clusters, sample-single sample from the database
    def get_index_from_euclidean_distances(clusters, sample):
        e_distances = np.sqrt(np.sum(np.power(np.subtract(clusters,sample),2), axis=1))
        # return index with the minimal distance
        return np.where(e_distances==np.min(e_distances))[0]

    new_assignments = np.empty((0,1), int)
    # iterate through all samples
    for i in range(data.shape[0]):
        new_assignments = np.append(new_assignments, get_index_from_euclidean_distances(clusters,data[i]))
    # return new assignments and True if there is a difference to last assignments, False otherwise
    return new_assignments, find_difference(new_assignments, assignments)

但是它非常慢。如何使过程更快?还有其他解决问题的最佳方法吗?

编辑:

我上面的代码是k-means聚类算法的核心部分,占99.9%的执行时间。我正在从头开始构建它,以用于教育目的,并通过您的回答得到了我所需要的。(对不起,对于早期的编辑和混乱,这是我的第一个问题,以后的问题将更具体,并提供调试和调试所需的所有信息和数据。问题重现性)

谢谢Sobek。将np.apply_along_axis的性能从original提升到apply_along_axis

我将继续构建Eli Korvigo建议的解决方案。

非常感谢您!

2 个答案:

答案 0 :(得分:1)

修改

让我们假设,您在N维向量空间中有一组C重心点(clusters),具有欧几里得度量和一组Q查询点({{ 1}})。现在,如果您想为每个查询点找到最接近的质心,则可以使用搜索树(例如K-D tree)在大约samples中进行操作,而当前的方法是{{1} }。

O(QlogC)

原始答案(包括后记)

我看到循环内有O(Q**2)调用,对于优化欠佳的代码,通常将其视为危险信号,因为NumPy数组不是动态的:In [1]: import numpy as np In [2]: from sklearn.neighbors import DistanceMetric, KDTree In [3]: clusters = np.array([ ...: [0, 1], ...: [10, 5] ...: ]) In [4]: tree = KDTree(clusters, metric=DistanceMetric.get_metric('euclidean')) In [5]: samples = np.array([ ...: [0, 2], ...: [10, 6] ...: ]) In [6]: tree.query(samples, return_distance=False) Out[6]: array([[0], [1]]) 必须在每次迭代时重新分配并复制其操作数。您最好在列表中累积数组并在结果列表中调用np.append

np.append

P.S。

  1. 我不确定您是否在没有故意指定np.concatenate的情况下调用def assign_clusters_to_samples(data, clusters, assignments): # clusters-array of clusters, sample-single sample from the database def euclidean_distances(clusters, sample): e_distances = np.sqrt(np.sum(np.power(np.subtract(clusters,sample),2), axis=1)) # return index with the minimal distance return np.where(e_distances==np.min(e_distances))[0] # iterate through all samples acc = [euclidean_distances(clusters, data[i]).flatten() for i in range(data.shape[0])] new_assignments = np.concatenate(acc) # return new assignments and True if there is a difference to last assignments, False otherwise return new_assignments, find_difference(new_assignments, assignments) (毕竟,原始np.append对象显然是非平坦的):函数(以及扩展名) ,我的解决方案)在追加/串联之前将axis的返回值变平。
  2. 您的算法不是特别有效。任何距离搜索树的数据结构都将大大改善时间复杂度。
  3. 从设计角度而言,我认为您不应该在此函数内调用new_assignments。从我的角度来看,这是一个更清洁的解决方案:

    euclidean_distances

答案 1 :(得分:1)

读取euclidean_distances非常困难,因为您不使用数学运算符,而是使用numpy方法。 使用numpy.append非常慢,因为每次必须复制整个数组。

def assign_clusters_to_samples(data, clusters, assignments):
    # clusters-array of clusters, sample-single sample from the database
    def euclidean_distances(clusters, sample):
        e_distances = np.sum((clusters - sample)**2, axis=1)
        # return index with the minimal distance
        return np.argmin(e_distances)

    new_assignments = [
        euclidean_distances(clusters,d)
        for d in data
    ]
    # return new assignments and True if there is a difference to last assignments, False otherwise
    return new_assignments, find_difference(new_assignments, assignments)