H2O-在网格搜索中使用LASSO进行特征选择

时间:2019-06-27 14:16:55

标签: python pandas h2o feature-selection grid-search

我正在处理高维数据集,需要进行一些特征选择。我已经在H2O中使用了Random Forest,并且我想使用LASSO正则化来查看它是否胜过随机森林。

下面是我创建的代码。当我指定alpha = 0(RIDGE正则化)时,代码可以正常工作,并且不会引发任何错误。但是,当我将alpha = 1(LASSO)放入时,会出现错误“ ZeroDivisionError:浮点除以零”。

为了获得LASSO,我遵循了这篇文章的建议:Attribute selection in h2o

代码:

from h2o.estimators.glm import H2OGeneralizedLinearEstimator
from h2o.grid.grid_search import H2OGridSearch

# feature columns et target column
x_lasso_h2o = list(feature_matrix.columns)
y_h2o = response_column

# select the values for lambda to grid over
hyper_params = {'lambda': list(arange(0.001,1,0.01))}

search_criteria_dim_reduction= {'strategy': 'RandomDiscrete',
                   'max_runtime_secs': 100,
                   'max_models': 5,
                   'stopping_metric': "auto",
                   'stopping_tolerance': 0.001,
                   'stopping_rounds': 5,
                   'seed': 1234}


# Train and validate a cartesian grid of GLMs
glm_grid_lasso = H2OGridSearch(model=H2OGeneralizedLinearEstimator(family= "binomial",nfolds = 5,alpha = 1,balance_classes = True),hyper_params=hyper_params,search_criteria=search_criteria_dim_reduction)

glm_grid_lasso.train(x=x_lasso_h2o, y=y_h2o,training_frame= train_h2o)

# Get the grid results, sorted by validation AUC
glm_lasso_gridperf = glm_grid_lasso.get_grid(sort_by='auc', decreasing=True)

best_lasso = glm_lasso_gridperf.model_ids[0]
best_lasso = h2o.get_model(best_lasso)
var_imp_pd_lasso = pd.DataFrame(best_lasso.varimp(True))

引发的错误是:

ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-51-ca2e05533e82> in <module>
     32 best_lasso = glm_lasso_gridperf.model_ids[0]
     33 best_lasso = h2o.get_model(best_lasso)
---> 34 var_imp_pd_lasso = pd.DataFrame(best_lasso.varimp(True))

~\AppData\Local\Continuum\anaconda3\lib\site-packages\h2o\model\model_base.py in varimp(self, use_pandas)
    444                 vals = []
    445                 for item in tempvals:
--> 446                     tempT = (item[0], item[1], item[1]/maxVal, item[1]/sum)
    447                     vals.append(tempT)
    448                 header = ["variable", "relative_importance", "scaled_importance", "percentage"]

ZeroDivisionError: float division by zero

我的想法:由于LASSO将“非重要”变量减少为0,因此可以解释为什么存在除以0的错误。我期望的输出是所有变量及其各自重要性的列表。

谢谢您的帮助。

最诚挚的问候。

0 个答案:

没有答案