Question

我正在尝试在sklearn中运行GridSearchCV for Logistic Regression，代码给出了以下错误：

ValueError: X has 21 features per sample; expecting 19

训练和测试数据的形状是

X_train.shape
(891L, 21L)  

X_test.shape
(418L, 21L)

我用来运行GridSearchCV的代码是

from sklearn.linear_model import LogisticRegression
from sklearn.grid_search import GridSearchCV
logistic = LogisticRegression()

parameters = [{'C' : [1.0, 10.0, 100.0, 1000.0],
               'fit_intercept' : ['True', 'False'],
               'intercept_scaling' : [0, 1, 10, 100, 1000],
               'class_weight' : ['auto'],
               'random_state' : [26],
               'tol' : [0.001, 0.01, 0.1, 1, 10, 100]
               }]

logistic = GridSearchCV(LogisticRegression(),
                        parameters,
                        cv=3,
                        refit=True,
                        verbose=1)

logistic = logistic.fit(X_train, y_train)
logit_pred = logistic.predict(X_test)

我得到的追溯是：

ValueError                                Traceback (most recent call last)
C:\Code\kaggle\titanic\titanic.py in <module>()
    351 
    352 
--> 353     logistic = logistic.fit(X_train, y_train)
    354 
    355     logit_pred = logistic.predict(X_test)

C:\Users\User\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\grid_search.pyc in fit(self, X, y)
    594 
    595         """
--> 596         return self._fit(X, y, ParameterGrid(self.param_grid))
    597 
    598 

C:\Users\User\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\grid_search.pyc in _fit(self, X, y, parameter_iterable)
    376                                     train, test, self.verbose, parameters,
    377                                     self.fit_params, return_parameters=True)
--> 378             for parameters in parameter_iterable
    379             for train, test in cv)
    380 

C:\Users\User\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\externals\joblib\parallel.pyc in __call__(self, iterable)
    651             self._iterating = True
    652             for function, args, kwargs in iterable:
--> 653                 self.dispatch(function, args, kwargs)
    654 
    655             if pre_dispatch == "all" or n_jobs == 1:

C:\Users\User\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\externals\joblib\parallel.pyc in dispatch(self, func, args, kwargs)
    398         """
    399         if self._pool is None:
--> 400             job = ImmediateApply(func, args, kwargs)
    401             index = len(self._jobs)
    402             if not _verbosity_filter(index, self.verbose):

C:\Users\User\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\externals\joblib\parallel.pyc in __init__(self, func, args, kwargs)
    136         # Don't delay the application, to avoid keeping the input
    137         # arguments in memory
--> 138         self.results = func(*args, **kwargs)
    139 
    140     def get(self):

C:\Users\User\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\cross_validation.pyc in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters)
   1238     else:
   1239         estimator.fit(X_train, y_train, **fit_params)
-> 1240     test_score = _score(estimator, X_test, y_test, scorer)
   1241     if return_train_score:
   1242         train_score = _score(estimator, X_train, y_train, scorer)

C:\Users\User\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\cross_validation.pyc in _score(estimator, X_test, y_test, scorer)
   1294         score = scorer(estimator, X_test)
   1295     else:
-> 1296         score = scorer(estimator, X_test, y_test)
   1297     if not isinstance(score, numbers.Number):
   1298         raise ValueError("scoring must return a number, got %s (%s) instead."

C:\Users\User\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\metrics\scorer.pyc in _passthrough_scorer(estimator, *args, **kwargs)
    174 def _passthrough_scorer(estimator, *args, **kwargs):
    175     """Function that wraps estimator.score"""
--> 176     return estimator.score(*args, **kwargs)
    177 
    178 

C:\Users\User\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\base.pyc in score(self, X, y, sample_weight)
    289         """
    290         from .metrics import accuracy_score
--> 291         return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
    292 
    293 

C:\Users\User\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\linear_model\base.pyc in predict(self, X)
    213             Predicted class label per sample.
    214         """
--> 215         scores = self.decision_function(X)
    216         if len(scores.shape) == 1:
    217             indices = (scores > 0).astype(np.int)

C:\Users\User\AppData\Local\Enthought\Canopy\User\lib\site-packages\sklearn\linear_model\base.pyc in decision_function(self, X)
    194         if X.shape[1] != n_features:
    195             raise ValueError("X has %d features per sample; expecting %d"
--> 196                              % (X.shape[1], n_features))
    197 
    198         scores = safe_sparse_dot(X, self.coef_.T,

ValueError: X has 21 features per sample; expecting 19

为什么GridSearchCV期望的功能数量与数据集不同？

更新：

感谢Andy的回应。数据集都是numpy.ndarray类型，dtype是float64。

type(X_Train)    type(y_train)    type(X_test)
numpy.ndarray    numpy.ndarray    numpy.ndarray

在我将它们带入sklearn之前的步骤：

train_data = traindf.values
test_data = testdf.values

X_train = train_data[0::, 1::]  # training features
y_train = train_data[0::, 0]    # training targets
X_test = test_data[0::, 0::]    # test features

下一步是我在上面输入的GridSearchCV代码...

更新2：链接到数据

以下是datasets

的链接

Answer 1

错误是由intercept_scaling=0引起的。看起来像是scikit-learn中的一个错误。

sklearn GridSearchCV：ValueError：X每个样本有21个特征;期待19

1 个答案: