在尝试评估决策树回归模型时测试分数 NaN

时间:2021-06-07 12:46:04

标签: python decision-tree scikit-learn-pipeline

我正在尝试使用来自艾姆斯住房数据集的数值和分类特征来评估决策树模型的准确性。对于数值特征的预处理,我使用了 SimpleImputer 和 StandardScalar。至于分类特征,我使用了 one hot encoder。我尝试使用 10 折交叉验证来评估决策树模型(决策树回归器),但我得到了测试分数的 Nan 值。这是我的代码:

import pandas as pd
ames_housing = pd.read_csv("../datasets/house_prices.csv", na_values="?")
target_name = "SalePrice"
data = ames_housing.drop(columns=target_name)
target = ames_housing[target_name]

numerical_features = [
"LotFrontage", "LotArea", "MasVnrArea", "BsmtFinSF1", "BsmtFinSF2",
"BsmtUnfSF", "TotalBsmtSF", "1stFlrSF", "2ndFlrSF", "LowQualFinSF",
"GrLivArea", "BedroomAbvGr", "KitchenAbvGr", "TotRmsAbvGrd", "Fireplaces",
"GarageCars", "GarageArea", "WoodDeckSF", "OpenPorchSF", "EnclosedPorch",
"3SsnPorch", "ScreenPorch", "PoolArea", "MiscVal",]

data_numerical = data[numerical_features]

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate
from sklearn.tree import DecisionTreeRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_selector as selector
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder

categorical_columns = selector(dtype_include=object)(data)
numerical_columns = selector(dtype_exclude=object)(data)

preprocessor = make_column_transformer(
(OneHotEncoder(handle_unknown="ignore"), categorical_columns),
(StandardScaler(), SimpleImputer(), numerical_columns),
)

model = make_pipeline(preprocessor, DecisionTreeRegressor())

cv_results = cross_validate(
model, data, target, cv=10, return_estimator=True, n_jobs=2,
)

scores = cv_results["test_score"]
print(f"Accuracy score by cross-validation "
  f"search:\n{scores.mean():.3f} +/- {scores.std():.3f}")

这是我得到的考试成绩:

Accuracy score by cross-validation search:
nan +/- nan

为了找出问题的根源,我在交叉验证中传递了 (error_score='raise') 作为参数。结果,发现错误是:

 ValueError: No valid specification of the columns. Only a scalar, list or slice of all integers 
  or all strings, or boolean mask is allowed

我该如何解决这个问题?任何帮助都感激不尽。谢谢:)

这是我的模型的样子:

Pipeline(steps=[('columntransformer',
             ColumnTransformer(transformers=[('onehotencoder',
                                              OneHotEncoder(handle_unknown='ignore'),
                                              ['MSZoning', 'Street',
                                               'Alley', 'LotShape',
                                               'LandContour', 'Utilities',
                                               'LotConfig', 'LandSlope',
                                               'Neighborhood', 'Condition1',
                                               'Condition2', 'BldgType',
                                               'HouseStyle', 'RoofStyle',
                                               'RoofMatl', 'Exterior1st',
                                               'Exterior2nd', 'MasVnrType',
                                               'ExterQual', 'ExterCond',
                                               'Foundation', 'BsmtQual',
                                               'BsmtCond', 'BsmtExposure',
                                               'BsmtFinType1',
                                               'BsmtFinType2', 'Heating',
                                               'HeatingQC', 'CentralAir',
                                               'Electrical', ...]),
                                             ('standardscaler',
                                              StandardScaler(),
                                              SimpleImputer())])),
            ('decisiontreeregressor', DecisionTreeRegressor())])

数据:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 80 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallCond    1460 non-null   int64  
 19  YearBuilt      1460 non-null   int64  
 20  YearRemodAdd   1460 non-null   int64  
 21  RoofStyle      1460 non-null   object 
 22  RoofMatl       1460 non-null   object 
 23  Exterior1st    1460 non-null   object 
 24  Exterior2nd    1460 non-null   object 
 25  MasVnrType     1452 non-null   object 
 26  MasVnrArea     1452 non-null   float64
 27  ExterQual      1460 non-null   object 
 28  ExterCond      1460 non-null   object 
 29  Foundation     1460 non-null   object 
 30  BsmtQual       1423 non-null   object 
 31  BsmtCond       1423 non-null   object 
 32  BsmtExposure   1422 non-null   object 
 33  BsmtFinType1   1423 non-null   object 
 34  BsmtFinSF1     1460 non-null   int64  
 35  BsmtFinType2   1422 non-null   object 
 36  BsmtFinSF2     1460 non-null   int64  
 37  BsmtUnfSF      1460 non-null   int64  
 38  TotalBsmtSF    1460 non-null   int64  
 39  Heating        1460 non-null   object 
 40  HeatingQC      1460 non-null   object 
 41  CentralAir     1460 non-null   object 
 42  Electrical     1459 non-null   object 
 43  1stFlrSF       1460 non-null   int64  
 44  2ndFlrSF       1460 non-null   int64  
 45  LowQualFinSF   1460 non-null   int64  
 46  GrLivArea      1460 non-null   int64  
 47  BsmtFullBath   1460 non-null   int64  
 48  BsmtHalfBath   1460 non-null   int64  
 49  FullBath       1460 non-null   int64  
 50  HalfBath       1460 non-null   int64  
 51  BedroomAbvGr   1460 non-null   int64  
 52  KitchenAbvGr   1460 non-null   int64  
 53  KitchenQual    1460 non-null   object 
 54  TotRmsAbvGrd   1460 non-null   int64  
 55  Functional     1460 non-null   object 
 56  Fireplaces     1460 non-null   int64  
 57  FireplaceQu    770 non-null    object 
 58  GarageType     1379 non-null   object 
 59  GarageYrBlt    1379 non-null   float64
 60  GarageFinish   1379 non-null   object 
 61  GarageCars     1460 non-null   int64  
 62  GarageArea     1460 non-null   int64  
 63  GarageQual     1379 non-null   object 
 64  GarageCond     1379 non-null   object 
 65  PavedDrive     1460 non-null   object 
 66  WoodDeckSF     1460 non-null   int64  
 67  OpenPorchSF    1460 non-null   int64  
 68  EnclosedPorch  1460 non-null   int64  
 69  3SsnPorch      1460 non-null   int64  
 70  ScreenPorch    1460 non-null   int64  
 71  PoolArea       1460 non-null   int64  
 72  PoolQC         7 non-null      object 
 73  Fence          281 non-null    object 
 74  MiscFeature    54 non-null     object 
 75  MiscVal        1460 non-null   int64  
 76  MoSold         1460 non-null   int64  
 77  YrSold         1460 non-null   int64  
 78  SaleType       1460 non-null   object 
 79  SaleCondition  1460 non-null   object 
dtypes: float64(3), int64(34), object(43)
memory usage: 912.6+ KB

目标:

0       208500
1       181500
2       223500
3       140000
4       250000
         ...  
1455    175000
1456    210000
1457    266500
1458    142125
1459    147500
Name: SalePrice, Length: 1460, dtype: int64

1 个答案:

答案 0 :(得分:0)

如果你的一个转换器有 1 个以上的估计器,在这种情况下,对于数字列,你有 StandardScaler(), SimpleImputer() ,你需要用管道包装它,例如:

np.random.seed(111)
data = pd.DataFrame(np.random.uniform(0,1,(100,3)),columns=['n1','n2','n3'])
data['c1'] = np.random.choice(['A','B',],100)
target = np.random.normal(0,1,100)

cat_columns = selector(dtype_include=object)(data)
num_columns = selector(dtype_exclude=object)(data)

num_transformer = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('imputer', SimpleImputer())
    ])

preprocessor = make_column_transformer(
(OneHotEncoder(handle_unknown="ignore"), cat_columns),
(num_transformer, num_columns),
)

只需在数据集上进行测试,它就可以工作:

preprocessor.fit_transform(data)[:2]

array([[ 0.        ,  1.        ,  0.42472149, -1.23160187, -0.1782728 ],
       [ 0.        ,  1.        ,  0.95749076, -0.79751471, -1.12825996]])

然后运行一切:

model = make_pipeline(preprocessor, DecisionTreeRegressor())

cv_results = cross_validate(
model, data, target, cv=10, return_estimator=True, n_jobs=2,
)

scores = cv_results["test_score"]

array([-2.45981423, -7.88563769, -1.15523361, -0.56772717, -0.84663734,
       -0.61938564, -3.1854688 , -1.44865232, -0.41933732, -3.13719368])