PySpark:套索返回所有非零系数

时间:2019-03-26 16:15:05

标签: machine-learning pyspark logistic-regression

将我的数据分为训练和测试之后,我的训练数据大约有3300万条记录。我有77个功能和一个二进制响应。我用套索拟合逻辑回归。但是,套索返回所有列的非零系数值。我探索了特征之间的相关性,其中一些具有高达0.8和0.7的相关性。假设我有许多功能(77)并且这些功能是相关的,我希望套索能够过滤掉至少某些功能。为什么这没有发生?由于大小和隐私原因,我无法共享数据。但是我正在使用的代码如下。我正在寻找可能发生的事情和要探索的东西。

logreg = LogisticRegression(maxIter = 200, featuresCol = "features", 
                 labelCol = 'label', standardization = True, elasticNetParam = 1)
paramGrid_logreg = ParamGridBuilder().addGrid(logreg.regParam, np.linspace(0.0, 1, 11)).build()
crossval_logreg = CrossValidator(estimator = logreg, 
                                 estimatorParamMaps = paramGrid_logreg, 
                                 evaluator = BinaryClassificationEvaluator(), 
                                 numFolds = 10) 
cvModel_logreg = crossval_logreg.fit(train_df)

现在,系数如下:

cvModel_logreg.bestModel.coefficients
  

DenseVector([0.0022,-0.0216,-0.0261,-0.0018,0.0014,-0.0012,-0.0006,0.0023,-0.0024,-0.0003,-0.0114,0.0003,0.0,-0.0018,-0.0163,-0.0009,- 0.0022,0.0009,-0.0017,0.0005,-0.0024,0.0006,-0.0025,0.0015,-0.0025,0.0007,-0.002,0.0002,0.0007,0.0002,-0.0025,0.0008,-0.0012,-0.0001,-0.0017,0.0002,0.0026 ,-0.0002,-0.0019,0.0003,-0.0017,0.0005,-0.0019,0.0008,-0.0023,0.0008,-0.0021,-0.0003,-0.0021,0.0011,-0.0019,0.0003,-0.0018,-0.0006,0.0001,-0.0009 ,-0.3818,-0.0425,-0.0065,0.0014,0.1304,0.0003,-0.0698,0.0002,0.0005,0.0014,0.0161,-0.0005,0.0099,0.0003,0.051,-0.0006,-0.0001,-0.0005,0.3291,-0.0056, 0.0451])

0 个答案:

没有答案