我使用到数据砖列出的流水线建立了逻辑回归模型。 https://docs.databricks.com/spark/latest/mllib/binary-classification-mllib-pipelines.html
使用OneHotEncoderEstimator
对要素(数字和字符串要素)进行编码,然后使用标准缩放器对其进行转换。
我想知道如何将通过逻辑回归获得的权重(系数)映射到原始数据框中的特征名称。
换句话说,如何获得从模型获得的权重或系数的相应特征
谢谢
我尝试从lrModel.schema中提取功能,该功能提供了structField
列表,其中显示了这些功能
我尝试从架构中提取特征并映射到权重,但未成功
from pyspark.ml.classification import LogisticRegression
# Create initial LogisticRegression model
lr = LogisticRegression(labelCol="label", featuresCol="scaledFeatures", maxIter=10)
# Train model with Training Data
lrModel = lr.fit(trainingData)
predictions = lrModel.transform(trainingData)
LRschema = predictions.schema
提取元组列表的预期结果(特征权重,特征名称)
答案 0 :(得分:0)
这不是LogisticRegression的直接输出,但可以使用我使用的以下函数获得:
def ExtractFeatureCoeficient(model, dataset, excludedCols = None):
test = model.transform(dataset)
weights = model.coefficients
print('This is model weights: \n', weights)
weights = [(float(w),) for w in weights] # convert numpy type to float, and to tuple
if excludedCols == None:
feature_col = [f for f in test.schema.names if f not in ['y', 'classWeights', 'features', 'label', 'rawPrediction', 'probability', 'prediction']]
else:
feature_col = [f for f in test.schema.names if f not in excludedCols]
if len(weights) == len(feature_col):
weightsDF = sqlContext.createDataFrame(zip(weights, feature_col), schema= ["Coeficients", "FeatureName"])
else:
print('Coeficients are not matching with remaining Fetures in the model, please check field lists with model.transform(dataset).schema.names')
return weightsDF
results = ExtractFeatureCoeficient(lr_model, trainingData)
results.show()
这将生成一个带有以下字段的spark数据框:
+--------------------+--------------------+
| Coeficients| FeatureName|
+--------------------+--------------------+
|[0.15834847825223...| name |
| [0.0]| lat |
+--------------------+--------------------+
或者您可以按照以下方式拟合GML模型:
model = GeneralizedLinearRegression(family="binomial", link="logit", featuresCol="features", labelCol="label", maxIter = 1000, regParam = 0.8, weightCol="classWeights")
# Train model. This also runs the indexer.
models = glmModel.fit(trainingData)
# then get summary of the model:
summary = model.summary
print(summary)
生成输出:
Coefficients:
Feature Estimate Std Error T Value P Value
(Intercept) -1.3079 0.0705 -18.5549 0.0000
name 0.1248 0.0158 7.9129 0.0000
lat 0.0239 0.0209 1.1455 0.2520