我想计算57832个向量的成对距离。每个向量具有200个维度。我正在使用pdist来计算距离。
from scipy.spatial.distance import pdist
pairwise_distances = pdist(X, 'cosine')
# pdist is supposed to return a numpy array with shape (57832*57831,).
但是,这会导致内存错误。
Traceback (most recent call last):
File "/home/munichong/git/DomainClassification/NameSuggestion@Verisign/classification_DMOZ/main.py", line 101, in <module>
result_clustering = clf_clustering.getCVResult(shuffle)
File "/home/munichong/git/DomainClassification/NameSuggestion@Verisign/classification_DMOZ/ClusteringBasedClassification.py", line 158, in getCVResult
self.centroids_of_categories(X_train, y_train)
File "/home/munichong/git/DomainClassification/NameSuggestion@Verisign/classification_DMOZ/ClusteringBasedClassification.py", line 103, in centroids_of_categories
cat_centroids.append( self.dpc.centroids(X_in_this_cat) )
File "/home/munichong/git/DomainClassification/NameSuggestion@Verisign/classification_DMOZ/ClusteringBasedClassification.py", line 23, in centroids
distance_dict, rho_dict = self.compute_distances_and_rhos(X)
File "/home/munichong/git/DomainClassification/NameSuggestion@Verisign/classification_DMOZ/ClusteringBasedClassification.py", line 59, in compute_distances_and_rhos
pairwise_distances = pdist(X, 'cosine')
File "/usr/local/lib/python2.7/dist-packages/scipy/spatial/distance.py", line 1185, in pdist
dm = np.zeros((m * (m - 1)) // 2, dtype=np.double)
MemoryError
我的笔记本电脑的RAM是16GB。我该如何解决?还是有更好的方法吗?
答案 0 :(得分:2)
对大型数据集进行基于矩阵的算法是令人望而却步的。
内存要求很容易估算。即使利用对称性,许多实现也将在大约65000个实例中最大化。但即使64位实现和大型机器最终也会放弃。具有双精度和利用对称性的1000000x1000000矩阵需要4 TB的RAM。
使用不需要O(n ^ 2)内存和运行时的更好的算法。