Question

我正在运行randomForest我的数据集，请参见下面的结构：

str(MYDATA)
'data.frame':   55377 obs. of  12 variables:
 $ ï..Archive_Date: Factor w/ 12 levels "20/12/2018","26/04/2018",..: 10 10 10 10 10 10 10 10 10 10 ...
 $ Hospital_Group : Factor w/ 7 levels "Children's Hospital Group",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Group.ID       : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Hospital_HIPE  : int  940 940 940 940 940 940 940 940 940 940 ...
 $ Hospital_Name  : Factor w/ 44 levels "Bantry General Hospital",..: 40 40 40 40 40 40 40 40 40 40 ...
 $ Specialty_HIPE : int  0 0 400 400 600 600 600 600 600 600 ...
 $ Specialty_Name : Factor w/ 53 levels "Anaesthetics",..: 51 51 9 9 32 32 32 32 32 32 ...
 $ Case_Type      : Factor w/ 2 levels "Day Case","Inpatient": 2 2 1 1 1 1 1 1 1 2 ...
 $ Adult_Child    : Factor w/ 2 levels "Adult","Child": 2 2 2 2 2 2 2 2 2 2 ...
 $ Age_Profile    : Factor w/ 3 levels "0-15","16-64",..: 1 2 1 1 1 1 1 1 1 1 ...
 $ Time_Bands     : num  7.5 10.5 4.5 13.5 1.5 4.5 7.5 10.5 13.5 1.5 ...
 $ Total          : int  1 1 1 1 14 2 1 2 2 44

当我调用混淆矩阵时，出现以下错误：

rf <- predict(forest, MyDATA_Test, type = "class")
> confusionMatrix(rf, MyDATA_Test$Time_Bands, positive = "Yes")
Error: `data` and `reference` should be factors with the same levels.

如何解决此错误。

Answer 1

randomForest特别注重因子变量的级别相同。我们可以将测试数据集的levels更改为与train相同的级别。

# get the column names of factor columns
nm1 <- names(which(sapply(MYDATA, is.factor)))
#  get the levels of subset of columns in a `list`
lst1 <- lapply(MYDATA[nm1], levels)

# use Map to assign the `levels` of 'MyData_Test' with the train column levels
MYDATA_Test[nm1] <- Map(`levels<-`, MyDATA_Test[nm1], lst1)

注意：假设测试数据集中没有新级别（火车中不存在新级别）。

我正在我的数据集上运行randomForest-MYDATA

1 个答案: