Question

我正在尝试使用随机森林的train（）进行实际机器学习的课程。但是我遇到了2个问题。由于原始数据集非常大，我用2个小数据帧复制了这个问题，如下所示。

输入

library(caret)
f = data.frame(x = 1:10, y = 11:20)
f2 = data.frame(x = 1:5, y = 6:10)
fit <- train(y~., data = f, method="lm")
pred <- predict(fit, newdata = f2)
confusionMatrix(pred, f2)

输出（主要问题）

Error in sort.list(y) : 'x' must be atomic for 'sort.list'
Have you called 'sort' on a list?

如果我使用table函数而不是confusionMatrix，我会得到以下内容：

Error in table(pred, data = f2) : all arguments must have the same length

虽然pred的长度为5，f2$y的长度为5。

作为旁注，本例中的拟合函数偶尔也会给我一个错误，我也不明白。

Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,  :
There were missing values in resampled performance measures.

编辑：语法

Answer 1

我认为你有三个问题。

confusionMatrix期待两个向量，但f2是一个数据帧。相反，请confusionMatrix(pred, f2$y)。
但是这会产生不同的错误：The data must contain some levels that overlap the reference.。这提出了第二个问题。如果您查看f2的预测值和实际值，则没有重叠。从本质上讲，f和f2代表x和y之间完全不同的关系。你可以通过绘图来看到这一点。
```
library(tidyverse)
theme_set(theme_classic())

ggplot(bind_rows(f=f,f2=f2, .id="source"), aes(x,y,colour=source)) +
  geom_point() +
  geom_smooth(method="lm") 
```
此外，假数据中没有噪音，因此拟合是完美的（RMSE = 0且R平方= 1）。
```
fit
```
```
Resampling results:

  RMSE          Rsquared
  1.650006e-15  1
```
虚假数据集具有连续的结果变量。然而，混淆矩阵是用于检查分类模型质量的工具 - 即，结果是分类而非连续的数据。在这种情况下，您将使用适合分类的逻辑回归，随机森林等模型，而不是线性回归模型。然后，您将使用confusionMatrix将预测类与实际类进行比较。

以下是一个例子：

library(caret)

# Fake data
set.seed(100)
f = data.frame(y = c(rep(c("A","B"), c(100,25)),rep(c("B","A"), c(100,25))),
               x = c(rnorm(125, 1, 1), rnorm(125, 3, 1)))

# Train model on training data
set.seed(50)
idx = sample(1:nrow(f), 200)  # Indices of training observations
fit <- train(y ~ ., data = f[idx,], method="glm")

# Get predictions on probability scale
pred <- predict(fit, newdata=f[-idx, ], type="prob")

# Create data frame for confusion matrix
results = data.frame(pred=ifelse(pred$A < 0.5, "B","A"),
                     actual=f$y[-idx])

confusionMatrix(results$pred, results$actual)

Confusion Matrix and Statistics

          Reference
Prediction  A  B
         A 16  7
         B  6 21

               Accuracy : 0.74            
                 95% CI : (0.5966, 0.8537)
    No Information Rate : 0.56            
    P-Value [Acc > NIR] : 0.006698        

                  Kappa : 0.475           
 Mcnemar's Test P-Value : 1.000000        

            Sensitivity : 0.7273          
            Specificity : 0.7500          
         Pos Pred Value : 0.6957          
         Neg Pred Value : 0.7778          
             Prevalence : 0.4400          
         Detection Rate : 0.3200          
   Detection Prevalence : 0.4600          
      Balanced Accuracy : 0.7386          

       'Positive' Class : A

R - confusionMatrix（） - sort.list（y）中的错误：'x'必须是'sort.list'的原子

1 个答案: