我一直在R中使用e1071 :: svm(...,probability = TRUE)来拟合二进制SVM分类器,然后使用predict.svm()来获得训练样本和测试样本的概率。当我将概率转换为log(赔率)并将其与决策值进行对比时,我发现预测中存在不连续性:
Plot of log(odds) = log(prob/(1-prob)) vs. Decision Values
只要概率低于0.25%,其他模型也会发生这种情况;从log(赔率)= -5.98到-10.86始终存在差距。请注意,这确实不发生在固定的decision.value(随模型而变化)。我相信它也可能以高概率(> 99%)发生。
红色和绿色线是预测的线性拟合,其中log(赔率)分别为< -8和> -8。后者的系数与svm对象返回的probA和probB输出一致。我见过其他情况,差距从+5.98到+10.86(仅限)。
以下是使用虹膜数据集的示例:
require("datasets")
require("e1071")
iris$is.setosa <- as.numeric(iris$Species=="setosa")
set.seed(8675309)
fit <- svm(
is.setosa ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
data=iris,probability=T,cost=0.01,kernel="linear",type="C-classification")
preds <- predict(fit,prob=TRUE,newdata=iris,decision=T)
DVs <- attr(preds,"decision.values")[,1]
probs <- attr(preds,"probabilities")[,"1"]
logodds <- log(probs/(1-probs))
plot(DVs,logodds,xlab="decision.values",ylab="log(odds)",main="IRIS dataset")
cat("Coefficents of probability model reported by svm():\n")
print(fit[c("probA","probB")])
fit <- lm(logodds ~ DVs,subset=which(logodds> -8))
cat("fit of logodds ~ DVs when log(odds) greater than -8:\n")
print(summary(fit))
abline(fit,col="green",lty=3)
fit <- lm(logodds ~ DVs,subset=which(logodds< -8))
abline(fit,col="red",lty=3)
还有其他人看过这种行为吗?知道可能导致它的原因吗?谢谢!