我建立了一个神经网络,它可以处理大约300,000行的小型数据集,其中包含2个类别变量和1个自变量,但是当我将其增加到650万行时遇到了内存错误。因此,我决定修改代码,并且距离越来越近,但是现在我遇到了适合错误的问题。我有2个类别变量,其中一列是1和0(可疑或不可疑的因变量。从数据集开始看起来像这样:
DBF2
ParentProcess ChildProcess Suspicious
0 C:\Program Files (x86)\Wireless AutoSwitch\wrl... ... 0
1 C:\Program Files (x86)\Wireless AutoSwitch\wrl... ... 0
2 C:\Windows\System32\svchost.exe ... 1
3 C:\Program Files (x86)\Wireless AutoSwitch\wrl... ... 0
4 C:\Program Files (x86)\Wireless AutoSwitch\wrl... ... 0
5 C:\Program Files (x86)\Wireless AutoSwitch\wrl... ... 0
我的代码后面有错误:
import pandas as pd
import numpy as np
import hashlib
import matplotlib.pyplot as plt
import timeit
X = DBF2.iloc[:, 0:2].values
y = DBF2.iloc[:, 2].values#.ravel()
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X_1 = LabelEncoder()
X[:, 0] = labelencoder_X_1.fit_transform(X[:, 0])
labelencoder_X_2 = LabelEncoder()
X[:, 1] = labelencoder_X_2.fit_transform(X[:, 1])
onehotencoder = OneHotEncoder(categorical_features = [0,1])
X = onehotencoder.fit_transform(X)
index_to_drop = [0, 2039]
to_keep = list(set(xrange(X.shape[1]))-set(index_to_drop))
X = X[:,to_keep]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
#ERROR
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/sklearn/base.py", line 517, in fit_transform
return self.fit(X, **fit_params).transform(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/preprocessing/data.py", line 590, in fit
return self.partial_fit(X, y)
File "/usr/local/lib/python2.7/dist-packages/sklearn/preprocessing/data.py", line 621, in partial_fit
"Cannot center sparse matrices: pass `with_mean=False` "
ValueError: Cannot center sparse matrices: pass `with_mean=False` instead. See docstring for motivation and alternatives.
X_test = sc.transform(X_test)
#ERROR
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/sklearn/preprocessing/data.py", line 677, in transform
check_is_fitted(self, 'scale_')
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 768, in check_is_fitted
raise NotFittedError(msg % {'name': type(estimator).__name__})
sklearn.exceptions.NotFittedError: This StandardScaler instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.
如果有帮助,我可以打印X_train和y_train:
X_train
<5621203x7043 sparse matrix of type '<type 'numpy.float64'>'
with 11242334 stored elements in Compressed Sparse Row format>
y_train
array([0, 0, 0, ..., 0, 0, 0])
答案 0 :(得分:1)
X_train
是一个稀疏矩阵,非常适合在您使用大型数据集(如您的情况)的情况下使用。问题是,如documentation所述:
with_mean:布尔值,默认为True
如果为True,则在缩放之前将数据居中。这不起作用(并且将 引发异常)尝试使用稀疏矩阵时,因为 将它们居中需要构建一个常用的密集矩阵 情况可能太大而无法容纳在内存中。
您可以尝试传递with_mean=False
:
sc = StandardScaler(with_mean=False)
X_train = sc.fit_transform(X_train)
由于sc仍然是未经修改的StandardScaler
对象,因此以下行失败。
X_test = sc.transform(X_test)
要能够使用transform方法,首先必须将StandardScaler
拟合到数据集。如果您打算将StandardScaler
放在训练集上,并使用它来将训练集和测试集转换为相同的空间,则可以按照以下步骤进行操作:
sc = StandardScaler(with_mean=False)
X_train_sc = sc.fit(X_train)
X_train = X_train_sc.transform(X_train)
X_test = X_train_sc.transform(X_test)