我正在处理高维数据集,需要进行一些特征选择。我已经在H2O中使用了Random Forest,并且我想使用LASSO正则化来查看它是否胜过随机森林。
下面是我创建的代码。当我指定alpha = 0(RIDGE正则化)时,代码可以正常工作,并且不会引发任何错误。但是,当我将alpha = 1(LASSO)放入时,会出现错误“ ZeroDivisionError:浮点除以零”。
为了获得LASSO,我遵循了这篇文章的建议:Attribute selection in h2o
代码:
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
from h2o.grid.grid_search import H2OGridSearch
# feature columns et target column
x_lasso_h2o = list(feature_matrix.columns)
y_h2o = response_column
# select the values for lambda to grid over
hyper_params = {'lambda': list(arange(0.001,1,0.01))}
search_criteria_dim_reduction= {'strategy': 'RandomDiscrete',
'max_runtime_secs': 100,
'max_models': 5,
'stopping_metric': "auto",
'stopping_tolerance': 0.001,
'stopping_rounds': 5,
'seed': 1234}
# Train and validate a cartesian grid of GLMs
glm_grid_lasso = H2OGridSearch(model=H2OGeneralizedLinearEstimator(family= "binomial",nfolds = 5,alpha = 1,balance_classes = True),hyper_params=hyper_params,search_criteria=search_criteria_dim_reduction)
glm_grid_lasso.train(x=x_lasso_h2o, y=y_h2o,training_frame= train_h2o)
# Get the grid results, sorted by validation AUC
glm_lasso_gridperf = glm_grid_lasso.get_grid(sort_by='auc', decreasing=True)
best_lasso = glm_lasso_gridperf.model_ids[0]
best_lasso = h2o.get_model(best_lasso)
var_imp_pd_lasso = pd.DataFrame(best_lasso.varimp(True))
引发的错误是:
ZeroDivisionError Traceback (most recent call last)
<ipython-input-51-ca2e05533e82> in <module>
32 best_lasso = glm_lasso_gridperf.model_ids[0]
33 best_lasso = h2o.get_model(best_lasso)
---> 34 var_imp_pd_lasso = pd.DataFrame(best_lasso.varimp(True))
~\AppData\Local\Continuum\anaconda3\lib\site-packages\h2o\model\model_base.py in varimp(self, use_pandas)
444 vals = []
445 for item in tempvals:
--> 446 tempT = (item[0], item[1], item[1]/maxVal, item[1]/sum)
447 vals.append(tempT)
448 header = ["variable", "relative_importance", "scaled_importance", "percentage"]
ZeroDivisionError: float division by zero
我的想法:由于LASSO将“非重要”变量减少为0,因此可以解释为什么存在除以0的错误。我期望的输出是所有变量及其各自重要性的列表。
谢谢您的帮助。
最诚挚的问候。