我正在运行randomForest我的数据集,请参见下面的结构:
str(MYDATA)
'data.frame': 55377 obs. of 12 variables:
$ ï..Archive_Date: Factor w/ 12 levels "20/12/2018","26/04/2018",..: 10 10 10 10 10 10 10 10 10 10 ...
$ Hospital_Group : Factor w/ 7 levels "Children's Hospital Group",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Group.ID : int 1 1 1 1 1 1 1 1 1 1 ...
$ Hospital_HIPE : int 940 940 940 940 940 940 940 940 940 940 ...
$ Hospital_Name : Factor w/ 44 levels "Bantry General Hospital",..: 40 40 40 40 40 40 40 40 40 40 ...
$ Specialty_HIPE : int 0 0 400 400 600 600 600 600 600 600 ...
$ Specialty_Name : Factor w/ 53 levels "Anaesthetics",..: 51 51 9 9 32 32 32 32 32 32 ...
$ Case_Type : Factor w/ 2 levels "Day Case","Inpatient": 2 2 1 1 1 1 1 1 1 2 ...
$ Adult_Child : Factor w/ 2 levels "Adult","Child": 2 2 2 2 2 2 2 2 2 2 ...
$ Age_Profile : Factor w/ 3 levels "0-15","16-64",..: 1 2 1 1 1 1 1 1 1 1 ...
$ Time_Bands : num 7.5 10.5 4.5 13.5 1.5 4.5 7.5 10.5 13.5 1.5 ...
$ Total : int 1 1 1 1 14 2 1 2 2 44
当我调用混淆矩阵时,出现以下错误:
rf <- predict(forest, MyDATA_Test, type = "class")
> confusionMatrix(rf, MyDATA_Test$Time_Bands, positive = "Yes")
Error: `data` and `reference` should be factors with the same levels.
如何解决此错误。
答案 0 :(得分:0)
randomForest特别注重因子变量的级别相同。我们可以将测试数据集的levels
更改为与train相同的级别。
# get the column names of factor columns
nm1 <- names(which(sapply(MYDATA, is.factor)))
# get the levels of subset of columns in a `list`
lst1 <- lapply(MYDATA[nm1], levels)
# use Map to assign the `levels` of 'MyData_Test' with the train column levels
MYDATA_Test[nm1] <- Map(`levels<-`, MyDATA_Test[nm1], lst1)
注意:假设测试数据集中没有新级别(火车中不存在新级别)。