Question

我正在尝试设置3个模型 - 逻辑回归，CART和randomForest来模拟独立变量列表中的二进制因子“垃圾”（因变量）（文档术语矩阵，其中行是不同的文档和列是出现的条款。）

这是一个在线课程。我按照指示完成了所有典型的预处理，例如删除标点符号，停用单词，词干，并确保“垃圾邮件”变量是 “as.factored”等。

在我开始使用randomForest模型之前，它工作正常，最后我得到了错误。该课程明确地说我没有必要指定其他细节，如ntree或nodesize - 默认会自动使用。

经过更多的研究，似乎“000”指向导致问题的部分数据。 000实际上是我的文档术语矩阵中的第一个术语。为什么错误说没找到？

数据结构：

电子邮件＃“000”“帐户”...（电子邮件中的其他条款）..是垃圾邮件？

1      1     2   ... (any counts)...             0

感谢您对此有所了解。

> emailsSparse$spam = as.factor(emailsSparse$spam)
> set.seed(123)
> spl = sample.split(emailsSparse$spam, 0.7)
Error: could not find function "sample.split"
> library(caTools)
> spl = sample.split(emailsSparse$spam, 0.7)
> train = subset(emailsSparse, spl==TRUE)
> test = subset(emailsSparse, spl==FALSE)
> spamLog = glm(spam ~ ., data=train, family="binomial")
Warning messages:
1: glm.fit: algorithm did not converge 
2: glm.fit: fitted probabilities numerically 0 or 1 occurred 
spamC
> spamCART = rpart(spam~ ., data=train, method="class")
Error: could not find function "rpart"
> library(rpart)
> library(rpart.plot)
> spamCART = rpart(spam~ ., data=train, method="class")
> set.seed(123)
> spamRF = randomForest(spam ~., data=train)
Error: could not find function "randomForest"
> library(randomForest)
randomForest 4.6-10
Type rfNews() to see new features/changes/bug fixes.
> spamRF = randomForest(spam ~., data=train)
Error in eval(expr, envir, enclos) : object '000' not found

在R-Text Analytics

0 个答案: