编辑后的问题:
我有一个699行的数据集,在我正在研究的练习中,要求它生成300个观察值的训练集。剩下的就是测试集。我写了所有可能的信息,以便使情况更加清楚。
#First part of the code & Preprocessing
attach(Cancer_data)
names(Cancer_data)[1] <- "id"
names(Cancer_data)[2] <- "thickness"
names(Cancer_data)[3] <- "unif.size"
names(Cancer_data)[4] <- "unif.shape"
names(Cancer_data)[5] <- "adhesion"
names(Cancer_data)[6] <- "size"
names(Cancer_data)[7] <- "nuclei"
names(Cancer_data)[8] <- "chromatin"
names(Cancer_data)[9] <- "nucleoli"
names(Cancer_data)[10] <- "mitoses"
names(Cancer_data)[11] <- "Prognosis"
#Prognosis are my class labels 2 for benign cancer 4 for malignant
Prognosis <- as.factor(Cancer_data$Prognosis)
Cancer_data <- Cancer_data %>% dplyr :: select(-id)
直接传递给rpart模型,避免重新编写足够清晰的数据拆分,我用r part实现了这个分类树模型
rpart_model <- rpart(Prognosis ~.,method = "class",data = train_set)
#The train_set was implemented before with caret:: createDataPrtition()
现在这是主要问题,因为当我预测test_set上的树性能并尝试获取confusionMatrix R时,会返回此错误:
Error: `data` and `reference` should be factors with the same levels.
此处是已实现的代码
y_hat <- predict(rpart_model,test_set)
confusionMatrix(Cancer_data$Prognosis,y_hat)
我也尝试过
y_hat <- predict(rpart_model,type ='class')
如先前的Post
所建议对于这个问题的长度,我深表歉意,但我希望尽可能地精确。 预先谢谢你