调用todense()
时,这是我的代码和内存错误,我使用的是GBDT模型,想知道是否有人对如何解决内存错误有个好主意?谢谢。
for feature_colunm_name in feature_columns_to_use:
X_train[feature_colunm_name] = CountVectorizer().fit_transform(X_train[feature_colunm_name]).todense()
X_test[feature_colunm_name] = CountVectorizer().fit_transform(X_test[feature_colunm_name]).todense()
y_train = y_train.astype('int')
grd = GradientBoostingClassifier(n_estimators=n_estimator, max_depth=10)
grd.fit(X_train.values, y_train.values)
详细的错误消息,
in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError
...
致谢, 林
答案 0 :(得分:1)
这里有很多错误:
for feature_colunm_name in feature_columns_to_use:
X_train[feature_colunm_name] = CountVectorizer().fit_transform(X_train[feature_colunm_name]).todense()
X_test[feature_colunm_name] = CountVectorizer().fit_transform(X_test[feature_colunm_name]).todense()
1)您试图将多个列(结果CountVectorizer
将是一个二维数组,其中列代表要素)分配给DataFrame的单个列“ feature_colunm_name
”。那是行不通的,并且会产生错误。
2)您再次将CountVectorizer拟合到测试数据上,这是错误的。您应该在用于Trainind数据的测试数据上使用相同的CountVectorizer对象,并且只能调用transform()
,而不是fit_transform()
。
类似的东西:
cv = CountVectorizer()
X_train_cv = cv.fit_transform(X_train[feature_colunm_name])
X_test_cv = cv.transform(X_test[feature_colunm_name])
3)GradientBoostingClassifier
适用于稀疏数据。它尚未在文档中提及(似乎在文档中有错误)。
4)您似乎正在将原始数据的多列转换为词袋形式。为此,您将需要使用许多CountVectorizer对象,然后将所有输出数据合并到一个数组中,然后传递给GradientBoostingClassifier。
更新:
您需要设置以下内容:
# To merge sparse matrices
from scipy.sparse import hstack
result_matrix_train = None
result_matrix_test = None
for feature_colunm_name in feature_columns_to_use:
cv = CountVectorizer()
X_train_cv = cv.fit_transform(X_train[feature_colunm_name])
# Merge the vector with others
result_matrix_train = hstack((result_matrix_train, X_train_cv))
if result_matrix_train is not None else X_train_cv
# Now transform the test data
X_test_cv = cv.transform(X_test[feature_colunm_name])
result_matrix_test = hstack((result_matrix_test, X_test_cv))
if result_matrix_test is not None else X_test_cv
注意:如果还有其他列(因为它们已经是数值型的)而没有通过Countvectorizer处理,并且您想与result_matrix_train
合并,也可以通过以下方式做到这一点:
result_matrix_train = hstack((result_matrix_test, X_train[other_columns].values))
result_matrix_test = hstack((result_matrix_test, X_test[other_columns].values))
现在使用这些来训练:
...
grd.fit(result_matrix_train, y_train.values)