如何使用分类和非分类特征进行回归

时间:2020-08-17 13:56:51

标签: python machine-learning scikit-learn

如果我具有多个功能,但是有些功能是分类的,而有些则不是,那么用sklearn进行回归的正确方法是什么?

我正在尝试“ ColumnTransformer”,但不确定自己是否做得很好:

features = df[['grad', 'oblast', 'tip',
               'parcela', 'bruto', 'neto', 'osnova',
               'neto/bruto', 'zauzetost', 'sipovi', 'garaza',
               'nadzemno', 'podzemno', 'tavanica', 'fasada']]


results = df[['ukupno gradjevinski din']]


trans = ColumnTransformer(transformers=[('onehot', OneHotEncoder(), ['grad', 'oblast', 'tip', 'garaza', 'tavanica', 'fasada']),
                                        ('normalizer', Normalizer(), ['parcela', 'bruto', 'neto', 'osnova', 'neto/bruto', 'zauzetost', 'nadzemno'])],
                          remainder='passthrough') # Default is to drop untransformed columns

features = trans.fit_transform(features)

当我为某些功能打印corr()时,我发现它们与结果之间存在很大的相关性:

print(df[['parcela', 'bruto', 'neto', 'osnova', 'ukupno gradjevinski din']].corr().to_string())

                          parcela     bruto      neto    osnova  ukupno gradjevinski din
parcela                  1.000000  0.929939  0.930039  0.987574                 0.911690
bruto                    0.929939  1.000000  0.998390  0.943996                 0.878914
neto                     0.930039  0.998390  1.000000  0.946102                 0.889850
osnova                   0.987574  0.943996  0.946102  1.000000                 0.937064
ukupno gradjevinski din  0.911690  0.878914  0.889850  0.937064                 1.000000

问题是我堆叠了7-8个回归模型,并使用cross-validation对其进行了评估,但是我得到的分数从-10到-80,这对我来说并不正常。

regressors = [
              ["Bagging Regressor TREE", BaggingRegressor(base_estimator = DecisionTreeRegressor(max_depth=15))],
              ["Bagging Regressor FOREST", BaggingRegressor(base_estimator = RandomForestRegressor(n_estimators = 100))],
              ["Bagging Regressor linear", BaggingRegressor(base_estimator = LinearRegression(normalize=True))],
              ["Bagging Regressor lasso", BaggingRegressor(base_estimator = Lasso(normalize=True))],
              ["Bagging Regressor SVR rbf", BaggingRegressor(base_estimator = SVR(kernel = 'rbf', C=10.0, gamma='scale'))],
              ["Extra Trees Regressor", ExtraTreesRegressor(n_estimators = 150)],
              ["K-Neighbors Regressor", KNeighborsRegressor(n_neighbors=1)]]


for reg in regressors:

     scores = cross_val_score(reg[1], features, results, cv=5, scoring='r2')

     scores = np.average(scores)
     print(reg[0], scores)

每当涉及“线性袋装回归器”时,都会出现错误:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

即使我仅使用您在corr()中看到的功能运行回归模型,我也会得到相同的结果。

您能告诉我更多有关我遇到的问题吗?

1 个答案:

答案 0 :(得分:0)

将分类和非分类特征组合到回归模型的一种方法是在分类特征上使用单热编码。具体来说,如果您有一个可能具有3个值的分类功能,则可以创建3列,并根据其一键编码值填充0和1。

您可以在Introduction to Machine Learning with Python: A Guide for Data Scientists一书的“单热编码(虚拟变量)”部分的第213页上找到带有清晰说明,示例和实现的详细信息。