如何将朴素贝叶斯模型应用于新数据

时间:2015-10-05 18:46:39

标签: r naivebayes

我今天早上就此提出了一个问题,但我删除了这个问题并在此发布更多更好的措辞。

我使用火车和测试数据创建了我的第一个机器学习模型。我返回了一个混淆矩阵,看到了一些摘要统计数据。

我现在想将模型应用于新数据以进行预测,但我不知道如何。

上下文:预测每月“流失”取消。目标变量是“搅拌”的,它有两个可能的标签“搅拌”和“未搅拌”。

    head(tdata)
  months_subscription nvk_medium                                org_type     churned
1                  25       none                               Community not churned
2                   7       none                            Sports clubs not churned
3                  28       none                            Sports clubs not churned
4                  18    unknown Religious congregations and communities not churned
5                  15       none              Association - Professional not churned
6                   9       none              Association - Professional not churned

这是我的培训和测试:

 library("klaR")
 library("caret")

# import data
test_data_imp <- read.csv("tdata.csv")

# subset only required vars
# had to remove "revenue" since all churned records are 0 (need last price point)
variables <- c("months_subscription", "nvk_medium", "org_type", "churned")
tdata <- test_data_imp[variables]

#training
rn_train <- sample(nrow(tdata),
                   floor(nrow(tdata)*0.75))
train <- tdata[rn_train,]
test <- tdata[-rn_train,]
model <- NaiveBayes(churned ~., data=train)

# testing
predictions <- predict(model, test)
confusionMatrix(test$churned, predictions$class)

到目前为止,一切都很顺利。

现在我有了新的数据,结构和布局方式与上面的tdata相同。如何将我的模型应用于此新数据以进行预测?直觉上,我正在寻找一个新的列cbinded,它具有每个记录的预测类。

我试过了:

## prediction ##
# import data
data_imp <- read.csv("pdata.csv")
pdata <- data_imp[variables]

actual_predictions <- predict(model, pdata)

#append to data and output (as head by default)
predicted_data <- cbind(pdata, actual_predictions$class)

# output
head(predicted_data)

引发了错误

actual_predictions <- predict(model, pdata)
Error in object$tables[[v]][, nd] : subscript out of bounds
In addition: Warning messages:
1: In FUN(1:6433[[4L]], ...) :
  Numerical 0 probability for all classes with observation 1
2: In FUN(1:6433[[4L]], ...) :
  Numerical 0 probability for all classes with observation 2
3: In FUN(1:6433[[4L]], ...) :
  Numerical 0 probability for all classes with observation 3

如何将我的模型应用于新数据?我想要一个新数据框,其中包含一个具有预测类的新列?

**以下评论,这里是预测新数据的头部和主干**

head(pdata)
  months_subscription nvk_medium                                org_type     churned
1                  26       none                               Community not churned
2                   8       none                            Sports clubs not churned
3                  30       none                            Sports clubs not churned
4                  19    unknown Religious congregations and communities not churned
5                  16       none              Association - Professional not churned
6                  10       none              Association - Professional not churned
> str(pdata)
'data.frame':   6433 obs. of  4 variables:
 $ months_subscription: int  26 8 30 19 16 10 3 5 14 2 ...
 $ nvk_medium         : Factor w/ 16 levels "cloned","CommunityIcon",..: 9 9 9 16 9 9 9 3 12 9 ...
 $ org_type           : Factor w/ 21 levels "Advocacy and civic activism",..: 8 18 18 14 6 6 11 19 6 8 ...
 $ churned            : Factor w/ 1 level "not churned": 1 1 1 1 1 1 1 1 1 1 ...

1 个答案:

答案 0 :(得分:1)

This is most likely caused by a mismatch in the encoding of factors in the training data (variable tdata in your case) and the new data used in the predict function (variable pdata), typically that you have factor levels in the test data that are not present in the training data. Consistency in the encoding of the features must be enforced by you, because the predict function will not check it. Therefore, I suggest that you double-check the levels of the features nvk_medium and org_type in the two variables.

The error message:

 Error in object$tables[[v]][, nd] : subscript out of bounds

is raised when evaluating a given feature (the v-th feature) in a data point, in which nd is the numeric value of the factor corresponding to the feature. You also have warnings, indicating that the posterior probabilities for all the cases in data points ("observation") 1, 2, and 3 are all zero, but it is not clear if this is also related to the encoding of the factors...

To reproduce the error that you are seeing, consider the following toy data (from http://amunategui.github.io/binary-outcome-modeling/), which has a set of features somewhat similar to that in your data:

# Data setup
# From http://amunategui.github.io/binary-outcome-modeling/
titanicDF <- read.csv('http://math.ucdenver.edu/RTutorial/titanic.txt', sep='\t')
titanicDF$Title <- as.factor(ifelse(grepl('Mr ',titanicDF$Name),'Mr',ifelse(grepl('Mrs ',titanicDF$Name),'Mrs',ifelse(grepl('Miss',titanicDF$Name),'Miss','Nothing'))) )
titanicDF$Age[is.na(titanicDF$Age)] <- median(titanicDF$Age, na.rm=T)
titanicDF$Survived <- as.factor(titanicDF$Survived)
titanicDF <- titanicDF[c('PClass', 'Age',    'Sex',   'Title', 'Survived')]

# Separate into training and test data
inds_train <- sample(1:nrow(titanicDF), round(0.5 * nrow(titanicDF)), replace = FALSE)
Data_train <- titanicDF[inds_train, , drop = FALSE]
Data_test <- titanicDF[-inds_train, , drop = FALSE]

with:

> str(Data_train)

'data.frame':   656 obs. of  5 variables:
    $ PClass  : Factor w/ 3 levels "1st","2nd","3rd": 1 3 3 3 1 1 3 3 3 3 ...
$ Age     : num  35 28 34 28 29 28 28 28 45 28 ...
$ Sex     : Factor w/ 2 levels "female","male": 2 2 2 1 2 1 1 2 1 2 ...
$ Title   : Factor w/ 4 levels "Miss","Mr","Mrs",..: 2 2 2 1 2 4 3 2 3 2 ...
$ Survived: Factor w/ 2 levels "0","1": 2 1 1 1 1 2 1 1 2 1 ...

> str(Data_test)

'data.frame':   657 obs. of  5 variables:
    $ PClass  : Factor w/ 3 levels "1st","2nd","3rd": 1 1 1 1 1 1 1 1 1 1 ...
$ Age     : num  47 63 39 58 19 28 50 37 25 39 ...
$ Sex     : Factor w/ 2 levels "female","male": 2 1 2 1 1 2 1 2 2 2 ...
$ Title   : Factor w/ 4 levels "Miss","Mr","Mrs",..: 2 1 2 3 3 2 3 2 2 2 ...
$ Survived: Factor w/ 2 levels "0","1": 2 2 1 2 2 1 2 2 2 2 ...

Then everything goes as expected:

model <- NaiveBayes(Survived ~ ., data = Data_train)

# This will work
pred_1 <- predict(model, Data_test)

> str(pred_1)
List of 2
$ class    : Factor w/ 2 levels "0","1": 1 2 1 2 2 1 2 1 1 1 ...
..- attr(*, "names")= chr [1:657] "6" "7" "8" "9" ...
$ posterior: num [1:657, 1:2] 0.8352 0.0216 0.8683 0.0204 0.0435 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:657] "6" "7" "8" "9" ...
.. ..$ : chr [1:2] "0" "1"

However, if the encoding is not consistent, e.g.:

# Mess things up, by "displacing" the factor values (i.e., 'Nothing' 
# will now be encoded as number 5, which was not present in the 
# training data)
Data_test_2 <- Data_test
Data_test_2$Title <- factor(
    as.character(Data_test_2$Title), 
    levels = c("Dr", "Miss", "Mr", "Mrs", "Nothing")
)

> str(Data_test_2)

'data.frame':   657 obs. of  5 variables:
    $ PClass  : Factor w/ 3 levels "1st","2nd","3rd": 1 1 1 1 1 1 1 1 1 1 ...
$ Age     : num  47 63 39 58 19 28 50 37 25 39 ...
$ Sex     : Factor w/ 2 levels "female","male": 2 1 2 1 1 2 1 2 2 2 ...
$ Title   : Factor w/ 5 levels "Dr","Miss","Mr",..: 3 2 3 4 4 3 4 3 3 3 ...
$ Survived: Factor w/ 2 levels "0","1": 2 2 1 2 2 1 2 2 2 2 ...

then:

> pred_2 <- predict(model, Data_test_2)
Error in object$tables[[v]][, nd] : subscript out of bounds