使用AgglomerativeClustering算法训练模型时出现内存错误

时间:2020-03-04 08:55:05

标签: python python-3.x numpy scikit-learn hierarchical-clustering

我正在训练一种无监督学习的模型。数据集有1,40,000行和6列。文件大小为10637 KB的csv类型。

import numpy as np
import pandas as pd
import seaborn as sns
%matplotlib qt
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans 
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import AgglomerativeClustering

已导入库之上。

Rev =  pd.read_csv(r"Updated_Rev.csv")
labelEncoder = LabelEncoder()
labelEncoder.fit(Rev["Technology"])
Rev["Technology"] = labelEncoder.transform(Rev["Technology"])

一列是字符串,因此会对其进行编码,但将来可能不需要在训练中使用。

train = Rev.iloc[:,:4]
clustering = AgglomerativeClustering()
clustering.fit(train)

这是训练文件,因此训练需要所有行,并从中选择4列。 这样做时出现此错误

MemoryError: Unable to allocate 67.9 GiB for an array with shape (9117833280,) and data type float64

要注意的地方

  1. 我没有系统的管理员访问权限。
  2. 这是Windows操作系统,并使用Anaconda的基本环境。
  3. Anaconda仅针对特定用户安装。
MemoryError                               Traceback (most recent call last)
<ipython-input-16-0f4f354e9aaf> in <module>
      1 train = Rev.iloc[:,:4]
----> 2 clustering.fit(train)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\cluster\_agglomerative.py in fit(self, X, y)
    857                                          n_clusters=n_clusters,
    858                                          return_distance=return_distance,
--> 859                                          **kwargs)
    860         (self.children_,
    861          self.n_connected_components_,

~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\memory.py in __call__(self, *args, **kwargs)
    353 
    354     def __call__(self, *args, **kwargs):
--> 355         return self.func(*args, **kwargs)
    356 
    357     def call_and_shelve(self, *args, **kwargs):

~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\cluster\_agglomerative.py in ward_tree(X, connectivity, n_clusters, return_distance)
    232                           stacklevel=2)
    233         X = np.require(X, requirements="W")
--> 234         out = hierarchy.ward(X)
    235         children_ = out[:, :2].astype(np.intp)
    236 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\cluster\hierarchy.py in ward(y)
    828 
    829     """
--> 830     return linkage(y, method='ward', metric='euclidean')
    831 
    832 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\cluster\hierarchy.py in linkage(y, method, metric, optimal_ordering)
   1054                          'matrix looks suspiciously like an uncondensed '
   1055                          'distance matrix')
-> 1056         y = distance.pdist(y, metric)
   1057     else:
   1058         raise ValueError("`y` must be 1 or 2 dimensional.")

~\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\spatial\distance.py in pdist(X, metric, *args, **kwargs)
   2002     out = kwargs.pop("out", None)
   2003     if out is None:
-> 2004         dm = np.empty((m * (m - 1)) // 2, dtype=np.double)
   2005     else:
   2006         if out.shape != (m * (m - 1) // 2,):

MemoryError: Unable to allocate 67.9 GiB for an array with shape (9117833280,) and data type float64

0 个答案:

没有答案