我正在尝试使用e1071包中的naiveBayes()
函数。当我添加一个非零laplace
参数时,我得到的概率估计值没有变化,我不明白为什么。
示例:
library(e1071)
# Generate data
train.x <- data.frame(x1=c(1,1,0,0), x2=c(1,0,1,0))
train.y <- factor(c("cat", "cat", "dog", "dog"))
test.x <- data.frame(x1=c(1), x2=c(1))
# without laplace smoothing
classifier <- naiveBayes(x=train.x, y=train.y, laplace=0)
predict(classifier, test.x, type="raw") # returns (1, 0.00002507)
# with laplace smoothing
classifier <- naiveBayes(x=train.x, y=train.y, laplace=1)
predict(classifier, test.x, type="raw") # returns (1, 0.00002507)
我希望在这种情况下可能会发生变化,因为“dog”类的所有训练实例的 x1 都为0。要检查这一点,使用Python
是一回事Python示例:
import numpy as np
from sklearn.naive_bayes import BernoulliNB
train_x = pd.DataFrame({'x1':[1,1,0,0], 'x2':[1,0,1,0]})
train_y = np.array(["cat", "cat", "dog", "dog"])
test_x = pd.DataFrame({'x1':[1,], 'x2':[1,]})
# alpha (i.e. laplace = 0)
classifier = BernoulliNB(alpha=.00000001)
classifier.fit(X=train_x, y=train_y)
classifier.predict_proba(X=test_x) # returns (1, 0)
# alpha (i.e. laplace = 1)
classifier = BernoulliNB(alpha=1)
classifier.fit(X=train_x, y=train_y)
classifier.predict_proba(X=test_x) # returns (.75, .25)
为什么我使用e1071获得了这个意想不到的结果?
答案 0 :(得分:3)
拉普拉斯估计仅对分类特征有效,而不是数字特征。您可以在源代码中找到:
## estimation-function
est <- function(var)
if (is.numeric(var)) {
cbind(tapply(var, y, mean, na.rm = TRUE),
tapply(var, y, sd, na.rm = TRUE))
} else {
tab <- table(y, var)
(tab + laplace) / (rowSums(tab) + laplace * nlevels(var))
}
对于数值,使用高斯估计。因此,将您的数据转换为因子,您就可以了。
train.x <- data.frame(x1=c("1","1","0","0"), x2=c("1","0","1","0"))
train.y <- factor(c("cat", "cat", "dog", "dog"))
test.x <- data.frame(x1=c("1"), x2=c("1"))
# without laplace smoothing
classifier <- naiveBayes(x=train.x, y=train.y, laplace=0)
predict(classifier, test.x, type="raw") # returns (100% for dog)
# with laplace smoothing
classifier <- naiveBayes(x=train.x, y=train.y, laplace=1)
predict(classifier, test.x, type="raw") # returns (75% for dog)
答案 1 :(得分:1)
这个主要的facepalm。 naiveBayes()
方法将 x1 和 x2 解释为数字变量,因此尝试在内部使用高斯条件概率分布(我认为)。将变量编码为因子解决了我的问题。
train.x <- data.frame(x1=factor(c(1,1,0,0)), x2=factor(c(1,0,1,0)))
train.y <- factor(c("cat", "cat", "dog", "dog"))
test.x <- data.frame(x1=factor(c(1)), x2=factor(c(1)))