Question

今天是我使用R的第一天，我遇到了一个无法找到解决方案的问题。我想在数据上使用决策树，并且我使用此命令：

library(tree)
options("na.action")
setwd('C:/Users/aanam/Documents/Amrita_internship') pschool =
read.csv('predict_school_new.csv', header = TRUE)

stree = pschool[,c(2,4,6,8,10)]
train = sample(1:nrow(stree),nrow(stree)/2)
test = -train

training_data = stree[train,]
testing_data = stree[test,] 
campus = stree[,1]

testing_campus = campus[test]
tree_model = tree(campus~.,data = training_data, na.action = "na.exclude")

我得到的错误是 -

tree_model = tree(campus~.,data = training_data, na.action = "na.exclude")
# Error in model.frame.default(formula = campus ~ ., data = training_data,  : 
#   variable lengths differ (found for 'campus_id')

我查看了NA字段，但没有

sum(is.na(stree))
# [1] 0

我还检查了各列的长度，它们都是一样的。

length(stree[1,])
# [1] 5
length(stree[,1])
# [1] 2412147
length(stree[,2])
# [1] 2412147
length(stree[,3])
# [1] 2412147
length(stree[,4])
# [1] 2412147
length(stree[,5])
# [1] 2412147

有谁能告诉我为什么会收到此错误？

Answer 1

在你的模特中你有

 tree(campus~., data = training_data, ...

你似乎在那里混合了两个不同的变量上下文。 campus部分似乎来自您在上面定义的等于campus的{{1}}变量。但是，stree[,1]从数据参数中提取所有值，在本例中为.。这比training_data短，因为您只选择了行的siubset。您应该比较的长度是

stree

我无法确定您的输入数据是否具有正确的标头。如果是这样，在使用公式（length(campus) #and nrow(training_data)）语法拟合模型时，最好使用公式中data.frame的列名。在data.frame内外混合变量并不是一个好主意。您正在设置~，因此看到

会很有趣

header=T

然后在你的公式中使用它们。如果第一列实际上被称为“campus_id”，如错误消息所示，那么只需使用

names(stree)

并且不要创建单独的tree(campus_id ~., data = training_data, ...变量。

Answer 2

比较这两个：

length(campus)

nrow(training_data)

他们应该匹配，因为校园应该来自训练数据！

问题是，在这里你的训练数据被分成两半，你从主数据集中选择校园，主数据集有两行数作为训练数据。 R很困惑使用哪个校园，它使用主数据集。所以这里使用$并让它参考你的训练数据。我希望这会有所帮助。

R中可变长度的模型框架出错

2 个答案: