Python:计算pariwise距离会导致内存错误

时间:2015-07-09 14:25:44

标签: python memory numpy scipy cluster-analysis

我想计算57832个向量的成对距离。每个向量具有200个维度。我正在使用pdist来计算距离。

from scipy.spatial.distance import pdist
pairwise_distances = pdist(X, 'cosine')
# pdist is supposed to return a numpy array with shape (57832*57831,).

但是,这会导致内存错误。

   Traceback (most recent call last):
  File "/home/munichong/git/DomainClassification/NameSuggestion@Verisign/classification_DMOZ/main.py", line 101, in <module>
    result_clustering = clf_clustering.getCVResult(shuffle)
  File "/home/munichong/git/DomainClassification/NameSuggestion@Verisign/classification_DMOZ/ClusteringBasedClassification.py", line 158, in getCVResult
    self.centroids_of_categories(X_train, y_train)
  File "/home/munichong/git/DomainClassification/NameSuggestion@Verisign/classification_DMOZ/ClusteringBasedClassification.py", line 103, in centroids_of_categories
    cat_centroids.append( self.dpc.centroids(X_in_this_cat) )
  File "/home/munichong/git/DomainClassification/NameSuggestion@Verisign/classification_DMOZ/ClusteringBasedClassification.py", line 23, in centroids
    distance_dict, rho_dict = self.compute_distances_and_rhos(X)
  File "/home/munichong/git/DomainClassification/NameSuggestion@Verisign/classification_DMOZ/ClusteringBasedClassification.py", line 59, in compute_distances_and_rhos
    pairwise_distances = pdist(X, 'cosine')
  File "/usr/local/lib/python2.7/dist-packages/scipy/spatial/distance.py", line 1185, in pdist
    dm = np.zeros((m * (m - 1)) // 2, dtype=np.double)
MemoryError

我的笔记本电脑的RAM是16GB。我该如何解决?还是有更好的方法吗?

1 个答案:

答案 0 :(得分:2)

对大型数据集进行基于矩阵的算法是令人望而却步的。

内存要求很容易估算。即使利用对称性,许多实现也将在大约65000个实例中最大化。但即使64位实现和大型机器最终也会放弃。具有双精度和利用对称性的1000000x1000000矩阵需要4 TB的RAM。

使用不需要O(n ^ 2)内存和运行时的更好的算法。