使用CountVectorizer

时间:2018-09-05 22:55:37

标签: python machine-learning scikit-learn xgboost

调用todense()时,这是我的代码和内存错误,我使用的是GBDT模型,想知道是否有人对如何解决内存错误有个好主意?谢谢。

  for feature_colunm_name in feature_columns_to_use:
    X_train[feature_colunm_name] = CountVectorizer().fit_transform(X_train[feature_colunm_name]).todense()
    X_test[feature_colunm_name] = CountVectorizer().fit_transform(X_test[feature_colunm_name]).todense()
  y_train = y_train.astype('int')
  grd = GradientBoostingClassifier(n_estimators=n_estimator, max_depth=10)
  grd.fit(X_train.values, y_train.values)

详细的错误消息,

in _process_toarray_args
    return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError
...

致谢, 林

1 个答案:

答案 0 :(得分:1)

这里有很多错误:

for feature_colunm_name in feature_columns_to_use:
    X_train[feature_colunm_name] = CountVectorizer().fit_transform(X_train[feature_colunm_name]).todense()
    X_test[feature_colunm_name] = CountVectorizer().fit_transform(X_test[feature_colunm_name]).todense()

1)您试图将多个列(结果CountVectorizer将是一个二维数组,其中列代表要素)分配给DataFrame的单个列“ feature_colunm_name”。那是行不通的,并且会产生错误。

2)您再次将CountVectorizer拟合到测试数据上,这是错误的。您应该在用于Trainind数据的测试数据上使用相同的CountVectorizer对象,并且只能调用transform(),而不是fit_transform()

类似的东西:

cv = CountVectorizer()
X_train_cv = cv.fit_transform(X_train[feature_colunm_name])
X_test_cv = cv.transform(X_test[feature_colunm_name])

3)GradientBoostingClassifier适用于稀疏数据。它尚未在文档中提及(似乎在文档中有错误)。

4)您似乎正在将原始数据的多列转换为词袋形式。为此,您将需要使用许多CountVectorizer对象,然后将所有输出数据合并到一个数组中,然后传递给GradientBoostingClassifier。

更新

您需要设置以下内容:

# To merge sparse matrices
from scipy.sparse import hstack

result_matrix_train = None
result_matrix_test = None

for feature_colunm_name in feature_columns_to_use:
    cv = CountVectorizer()
    X_train_cv = cv.fit_transform(X_train[feature_colunm_name])

    # Merge the vector with others
    result_matrix_train = hstack((result_matrix_train, X_train_cv)) 
                          if result_matrix_train is not None else X_train_cv

    # Now transform the test data
    X_test_cv = cv.transform(X_test[feature_colunm_name])
    result_matrix_test = hstack((result_matrix_test, X_test_cv)) 
                         if result_matrix_test is not None else X_test_cv

注意:如果还有其他列(因为它们已经是数值型的)而没有通过Countvectorizer处理,并且您想与result_matrix_train合并,也可以通过以下方式做到这一点:

result_matrix_train = hstack((result_matrix_test, X_train[other_columns].values)) 
result_matrix_test = hstack((result_matrix_test, X_test[other_columns].values)) 

现在使用这些来训练:

...
grd.fit(result_matrix_train, y_train.values)