具有两个numpy数组(samples vs clusters):
data(n_samples, n_featuers)
clusters(n_clusters, n_features)
目标是计算与每个样本最近的簇的索引的numpy数组:
new_assignments(n_samples)
当前代码如下:
def assign_clusters_to_samples(data, clusters, assignments):
# clusters-array of clusters, sample-single sample from the database
def get_index_from_euclidean_distances(clusters, sample):
e_distances = np.sqrt(np.sum(np.power(np.subtract(clusters,sample),2), axis=1))
# return index with the minimal distance
return np.where(e_distances==np.min(e_distances))[0]
new_assignments = np.empty((0,1), int)
# iterate through all samples
for i in range(data.shape[0]):
new_assignments = np.append(new_assignments, get_index_from_euclidean_distances(clusters,data[i]))
# return new assignments and True if there is a difference to last assignments, False otherwise
return new_assignments, find_difference(new_assignments, assignments)
但是它非常慢。如何使过程更快?还有其他解决问题的最佳方法吗?
编辑:
我上面的代码是k-means聚类算法的核心部分,占99.9%的执行时间。我正在从头开始构建它,以用于教育目的,并通过您的回答得到了我所需要的。(对不起,对于早期的编辑和混乱,这是我的第一个问题,以后的问题将更具体,并提供调试和调试所需的所有信息和数据。问题重现性)
谢谢Sobek。将np.apply_along_axis
的性能从original提升到apply_along_axis。
我将继续构建Eli Korvigo建议的解决方案。
非常感谢您!
答案 0 :(得分:1)
修改
让我们假设,您在N维向量空间中有一组C
重心点(clusters
),具有欧几里得度量和一组Q
查询点({{ 1}})。现在,如果您想为每个查询点找到最接近的质心,则可以使用搜索树(例如K-D tree)在大约samples
中进行操作,而当前的方法是{{1} }。
O(QlogC)
原始答案(包括后记)
我看到循环内有O(Q**2)
调用,对于优化欠佳的代码,通常将其视为危险信号,因为NumPy数组不是动态的:In [1]: import numpy as np
In [2]: from sklearn.neighbors import DistanceMetric, KDTree
In [3]: clusters = np.array([
...: [0, 1],
...: [10, 5]
...: ])
In [4]: tree = KDTree(clusters, metric=DistanceMetric.get_metric('euclidean'))
In [5]: samples = np.array([
...: [0, 2],
...: [10, 6]
...: ])
In [6]: tree.query(samples, return_distance=False)
Out[6]:
array([[0],
[1]])
必须在每次迭代时重新分配并复制其操作数。您最好在列表中累积数组并在结果列表中调用np.append
。
np.append
P.S。
np.concatenate
的情况下调用def assign_clusters_to_samples(data, clusters, assignments):
# clusters-array of clusters, sample-single sample from the database
def euclidean_distances(clusters, sample):
e_distances = np.sqrt(np.sum(np.power(np.subtract(clusters,sample),2), axis=1))
# return index with the minimal distance
return np.where(e_distances==np.min(e_distances))[0]
# iterate through all samples
acc = [euclidean_distances(clusters, data[i]).flatten() for i in range(data.shape[0])]
new_assignments = np.concatenate(acc)
# return new assignments and True if there is a difference to last assignments, False otherwise
return new_assignments, find_difference(new_assignments, assignments)
(毕竟,原始np.append
对象显然是非平坦的):函数(以及扩展名) ,我的解决方案)在追加/串联之前将axis
的返回值变平。从设计角度而言,我认为您不应该在此函数内调用new_assignments
。从我的角度来看,这是一个更清洁的解决方案:
euclidean_distances
答案 1 :(得分:1)
读取euclidean_distances非常困难,因为您不使用数学运算符,而是使用numpy方法。 使用numpy.append非常慢,因为每次必须复制整个数组。
def assign_clusters_to_samples(data, clusters, assignments):
# clusters-array of clusters, sample-single sample from the database
def euclidean_distances(clusters, sample):
e_distances = np.sum((clusters - sample)**2, axis=1)
# return index with the minimal distance
return np.argmin(e_distances)
new_assignments = [
euclidean_distances(clusters,d)
for d in data
]
# return new assignments and True if there is a difference to last assignments, False otherwise
return new_assignments, find_difference(new_assignments, assignments)