让我们从简单的线性回归输出(copied from here)开始,
Call:
lm(formula = a1 ~ ., data = clean.algae[, 1:12])
Residuals:
Min 1Q Median 3Q Max
-37.679 -11.893 -2.567 7.410 62.190
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 42.942055 24.010879 1.788 0.07537 .
seasonspring 3.726978 4.137741 0.901 0.36892
seasonsummer 0.747597 4.020711 0.186 0.85270
seasonwinter 3.692955 3.865391 0.955 0.34065
sizemedium 3.263728 3.802051 0.858 0.39179
sizesmall 9.682140 4.179971 2.316 0.02166 *
speedlow 3.922084 4.706315 0.833 0.40573
speedmedium 0.246764 3.241874 0.076 0.93941
mxPH -3.589118 2.703528 -1.328 0.18598
mnO2 1.052636 0.705018 1.493 0.13715
Cl -0.040172 0.033661 -1.193 0.23426
NO3 -1.511235 0.551339 -2.741 0.00674 **
NH4 0.001634 0.001003 1.628 0.10516
oPO4 -0.005435 0.039884 -0.136 0.89177
PO4 -0.052241 0.030755 -1.699 0.09109 .
Chla -0.088022 0.079998 -1.100 0.27265
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 17.65 on 182 degrees of freedom
Multiple R-squared: 0.3731, Adjusted R-squared: 0.3215
F-statistic: 7.223 on 15 and 182 DF, p-value: 2.444e-12
从此输出中,我们可以看到模型适用性以及哪些变量显着影响目标变量。此外,我们可以通过查看系数的符号来查看变量是正影响还是负影响。
现在来看这个示例from the mlr
package manual,
### Select features
sfeats = selectFeatures(learner = "surv.coxph", task = wpbc.task, resampling = rdesc,
control = ctrl, show.info = FALSE)
sfeats
## FeatSel result:
## Features (14): mean_radius, mean_compactness, mean_concavepoints, mean_symmetry, mean_fractaldim, SE_perimeter, SE_area, SE_concavity, SE_fractaldim, worst_radius, worst_perimeter, worst_concavity, worst_concavepoints, tsize
## cindex.test.mean=0.6718346
从上面的输出中,我可以看到重要的功能列表。我的问题是,如何看待方向(正向或负向)特征(独立变量)正在影响目标变量?有人在这个问题上有帮助吗?推荐的阅读材料将不胜感激。
添加
我正在尝试在示例多标签分类yeast.task
example中实现您的建议,
library(mlr)
library(mmpf)
yeast <- getTaskData(yeast.task)
labels <- colnames(yeast)[1:14]
yeast.task <- makeMultilabelTask(id = "multi", data = yeast, target = labels)
lrn.br <- makeLearner("classif.rpart", predict.type = "prob")
lrn.br <- makeMultilabelBinaryRelevanceWrapper(lrn.br)
mod <- mlr::train(lrn.br, yeast.task, subset = 1:1500, weights = rep(1/1500, 1500))
pred <- predict(mod, newdata = yeast[1501:1600,])
performance(pred, measures = list(multilabel.subset01, multilabel.hamloss, multilabel.acc,
multilabel.f1, timepredict))
rdesc <- makeResampleDesc(method = "CV", stratify = FALSE, iters = 3)
r <- resample(learner = lrn.br, task = yeast.task, resampling = rdesc, show.info = FALSE)
getMultilabelBinaryPerformances(pred, measures = list(acc, mmce, auc))
getMultilabelBinaryPerformances(r$pred, measures = list(acc, mmce))
getLearnerModel(mod)
pd <- generatePartialDependenceData(mod, yeast.task)
plotPartialDependence(pd)
最后三行三行给了我以下输出。我不确定这些是否有用。知道我做错了什么吗?
> getLearnerModel(mod)
$label1
Model for learner.id=classif.rpart; learner.class=classif.rpart
Trained on: task.id = label1; obs = 1500; features = 103
Hyperparameters: xval=0
$label2
Model for learner.id=classif.rpart; learner.class=classif.rpart
Trained on: task.id = label2; obs = 1500; features = 103
Hyperparameters: xval=0
$label3
Model for learner.id=classif.rpart; learner.class=classif.rpart
Trained on: task.id = label3; obs = 1500; features = 103
Hyperparameters: xval=0
$label4
Model for learner.id=classif.rpart; learner.class=classif.rpart
Trained on: task.id = label4; obs = 1500; features = 103
Hyperparameters: xval=0
$label5
Model for learner.id=classif.rpart; learner.class=classif.rpart
Trained on: task.id = label5; obs = 1500; features = 103
Hyperparameters: xval=0
$label6
Model for learner.id=classif.rpart; learner.class=classif.rpart
Trained on: task.id = label6; obs = 1500; features = 103
Hyperparameters: xval=0
$label7
Model for learner.id=classif.rpart; learner.class=classif.rpart
Trained on: task.id = label7; obs = 1500; features = 103
Hyperparameters: xval=0
$label8
Model for learner.id=classif.rpart; learner.class=classif.rpart
Trained on: task.id = label8; obs = 1500; features = 103
Hyperparameters: xval=0
$label9
Model for learner.id=classif.rpart; learner.class=classif.rpart
Trained on: task.id = label9; obs = 1500; features = 103
Hyperparameters: xval=0
$label10
Model for learner.id=classif.rpart; learner.class=classif.rpart
Trained on: task.id = label10; obs = 1500; features = 103
Hyperparameters: xval=0
$label11
Model for learner.id=classif.rpart; learner.class=classif.rpart
Trained on: task.id = label11; obs = 1500; features = 103
Hyperparameters: xval=0
$label12
Model for learner.id=classif.rpart; learner.class=classif.rpart
Trained on: task.id = label12; obs = 1500; features = 103
Hyperparameters: xval=0
$label13
Model for learner.id=classif.rpart; learner.class=classif.rpart
Trained on: task.id = label13; obs = 1500; features = 103
Hyperparameters: xval=0
$label14
Model for learner.id=classif.rpart; learner.class=classif.rpart
Trained on: task.id = label14; obs = 1500; features = 103
Hyperparameters: xval=0
>
> pd <- generatePartialDependenceData(mod, yeast.task)
Error in data.table(preds, design[, vars, drop = FALSE], key = vars) :
column or argument 1 is NULL
> plotPartialDependence(pd)
Error in checkClass(x, classes, ordered, null.ok) : object 'pd' not found
答案 0 :(得分:2)
如果要提取基础学习者模型的系数,则必须在mlr中使用getLearnerModel()
:
library(mlr)
mod = train(learner = "surv.coxph", task = lung.task)
getLearnerModel(mod)
输出:
Call:
survival::coxph(formula = f, data = data)
coef exp(coef) se(coef) z p
inst -3.04e-02 9.70e-01 1.31e-02 -2.31 0.02062
age 1.28e-02 1.01e+00 1.19e-02 1.07 0.28340
sex -5.67e-01 5.67e-01 2.01e-01 -2.81 0.00489
ph.ecog 9.07e-01 2.48e+00 2.39e-01 3.80 0.00014
ph.karno 2.66e-02 1.03e+00 1.16e-02 2.29 0.02223
pat.karno -1.09e-02 9.89e-01 8.14e-03 -1.34 0.18016
meal.cal 2.60e-06 1.00e+00 2.68e-04 0.01 0.99224
wt.loss -1.67e-02 9.83e-01 7.91e-03 -2.11 0.03465
Likelihood ratio test=33.7 on 8 df, p=5e-05
n= 167, number of events= 120
如果您对不依赖于 learner 的可插拔性感兴趣,则可以查看部分依赖图。对于coxph
,它们不是令人惊讶的线性:
pd = generatePartialDependenceData(mod, lung.task)
plotPartialDependence(pd)
但是您也可以使用随机森林获取生存数据:
mod2 = train(learner = "surv.randomForestSRC", task = lung.task)
pd2 = generatePartialDependenceData(mod2, lung.task)
plotPartialDependence(pd2)
但是,部分依赖图也必须谨慎解释。所以你应该阅读它here。 您也可以看看ICE图。
答案 1 :(得分:0)
您可以使用结果的$opt.path
成员来获得有关在特征选择过程中发生的事情的更多信息,但是一般来说,为单个特征给出正相关和负相关/效果是没有意义的。对于大多数模型,您无法像线性回归一样定义方向相关性,因为它对特定模型没有意义。功能选择功能与模型无关。
即使评估某个特定功能是提高模型的性能(正效应)还是降低模型的性能(负效应),对于这种类型的选择也没有意义,因为它考虑了要素相互作用。对于特定功能,您可能会对其他功能的特定子集产生积极影响,而对其他功能却产生负面影响。
最后,将要素与输出关联不仅取决于模型,还取决于要素-仅适用于数字要素。