Question

我正在尝试制作决策树，但是当我在最后一行制作混淆矩阵时出现此错误：

Error : `data` and `reference` should be factors with the same levels

这是我的代码：

library(rpart)
library(caret)
library(dplyr)
library(rpart.plot)
library(xlsx)
library(caTools)
library(data.tree)
library(e1071)

#Loading the Excel File
library(readxl)
FINALDATA <- read_excel("Desktop/FINALDATA.xlsm")
View(FINALDATA)
df <- FINALDATA
View(df)

#Selecting the meaningful columns for prediction
#df <- select(df, City, df$`Customer type`, Gender, Quantity, Total, Date, Time, Payment, Rating)
df <- select(df, City, `Customer type`, Gender, Quantity, Total, Date, Time, Payment, Rating)

#making sure the data is in the right format 
df <- mutate(df, City= as.character(City), `Customer type`= as.character(`Customer type`), Gender= as.character(Gender), Quantity= as.numeric(Quantity), Total= as.numeric(Total), Time= as.numeric(Time), Payment = as.character(Payment), Rating= as.numeric(Rating))

#Splitting into training and testing data
set.seed(123)
sample = sample.split('Customer type', SplitRatio = .70)
train = subset(df, sample==TRUE)
test = subset(df, sample == FALSE)

#Training the Decision Tree Classifier
tree <- rpart(df$`Customer type` ~., data = train)

#Predictions
tree.customertype.predicted <- predict(tree, test, type= 'class')

#confusion Matrix for evaluating the model
confusionMatrix(tree.customertype.predicted, test$`Customer type`)

所以我尝试按照另一个主题中的说明进行操作：

confusionMatrix(table(tree.customertype.predicted, test$`Customer type`))

但我还是有错误：

Error in !all.equal(nrow(data), ncol(data)) : argument type is invalid

Answer 1

尽量保持 train 和 test 的因子水平与 df 相同。

train$`Customer type` <- factor(train$`Customer type`, unique(df$`Customer type`))
test$`Customer type` <- factor(test$`Customer type`, unique(df$`Customer type`))

Answer 2

我制作了一个玩具数据集并检查了您的代码。有几个问题：

R 可以更轻松地使用遵循特定风格的变量名称。您的“客户类型”变量中有一个空格。一般来说，避免空格时编码会更容易。因此，我将其重命名为“Customer_type”。对于您的 data.frame，您可以直接进入源文件，或使用 12", PIPE,, SA-106 GR. B,, SCH 40, WALL SMLS Specify Size One。
我将“Customer_type”编码为一个因素。对您来说，这看起来像 names(df) <- gsub("Customer type", "Customer_type", names(df))
df$Customer_type <- factor(df$Customer_type) 的文档说第一个参数“Y”应该是标签向量。但是在您的代码中，您提供了变量名称。标签是因子的级别的名称。在我的示例中，这些级别是高、中和低。要查看变量的级别，您可以使用 sample.split()。将这些作为字符向量输入到 levels(df$Customer_type)。
调整 sample.split() 调用，如下所示。

通过这些调整，您的代码可能没问题。

rpart()

混淆矩阵中的“具有相同水平的因素”

2 个答案: