我正在训练一种无监督学习的模型。数据集有1,40,000行和6列。文件大小为10637 KB的csv类型。
import numpy as np
import pandas as pd
import seaborn as sns
%matplotlib qt
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import AgglomerativeClustering
已导入库之上。
Rev = pd.read_csv(r"Updated_Rev.csv")
labelEncoder = LabelEncoder()
labelEncoder.fit(Rev["Technology"])
Rev["Technology"] = labelEncoder.transform(Rev["Technology"])
一列是字符串,因此会对其进行编码,但将来可能不需要在训练中使用。
train = Rev.iloc[:,:4]
clustering = AgglomerativeClustering()
clustering.fit(train)
这是训练文件,因此训练需要所有行,并从中选择4列。 这样做时出现此错误
MemoryError: Unable to allocate 67.9 GiB for an array with shape (9117833280,) and data type float64
MemoryError Traceback (most recent call last)
<ipython-input-16-0f4f354e9aaf> in <module>
1 train = Rev.iloc[:,:4]
----> 2 clustering.fit(train)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\cluster\_agglomerative.py in fit(self, X, y)
857 n_clusters=n_clusters,
858 return_distance=return_distance,
--> 859 **kwargs)
860 (self.children_,
861 self.n_connected_components_,
~\AppData\Local\Continuum\anaconda3\lib\site-packages\joblib\memory.py in __call__(self, *args, **kwargs)
353
354 def __call__(self, *args, **kwargs):
--> 355 return self.func(*args, **kwargs)
356
357 def call_and_shelve(self, *args, **kwargs):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\cluster\_agglomerative.py in ward_tree(X, connectivity, n_clusters, return_distance)
232 stacklevel=2)
233 X = np.require(X, requirements="W")
--> 234 out = hierarchy.ward(X)
235 children_ = out[:, :2].astype(np.intp)
236
~\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\cluster\hierarchy.py in ward(y)
828
829 """
--> 830 return linkage(y, method='ward', metric='euclidean')
831
832
~\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\cluster\hierarchy.py in linkage(y, method, metric, optimal_ordering)
1054 'matrix looks suspiciously like an uncondensed '
1055 'distance matrix')
-> 1056 y = distance.pdist(y, metric)
1057 else:
1058 raise ValueError("`y` must be 1 or 2 dimensional.")
~\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\spatial\distance.py in pdist(X, metric, *args, **kwargs)
2002 out = kwargs.pop("out", None)
2003 if out is None:
-> 2004 dm = np.empty((m * (m - 1)) // 2, dtype=np.double)
2005 else:
2006 if out.shape != (m * (m - 1) // 2,):
MemoryError: Unable to allocate 67.9 GiB for an array with shape (9117833280,) and data type float64