尝试加载许多电子邮件文件,让R了解垃圾邮件或火腿。首先,我创建了一个语料库,我想创建一个术语文档,我收到了一个错误。如何解决?
email_corpus <- Corpus(VectorSource(NA))
setwd("C:/ham_spam/")
library(tm)
library(stringr)
email_corpus <- Corpus(VectorSource(NA))
folders <- c("easy_ham/", "spam_2/")
for(n in 1:2){
folder <- folders[n]
for(i in 1:length(list.files(folder))){
email <- list.files(folder)[i]
tmp <- readLines(str_c(folder, email))
tmp <- str_c(tmp, collapse = "")
tmp_corpus <- Corpus(VectorSource(tmp))
email_corpus <- c(email_corpus, tmp_corpus)
}
}
dtm_email <- DocumentTermMatrix(email_corpus)
这是我收到的错误
下面的UseMethod(“TermDocumentMatrix”,x)中的错误: 没有适用于'TermDocumentMatrix'的方法应用于类“list”的对象
是email_corpus的示例,email_corpus是数据框列表。
$meta
$language
[1] "en"
attr(,"class")
[1] "CorpusMeta"
$dmeta
data frame with 0 columns and 1 row
$content
[1] "From Steve_Burt@cursor-system.com Thu Aug 22 12:46:39 2002Return-Path: <Steve_Burt@cursor-system.com>Delivered-To: zzzz@localhost.netnoteinc.comReceived: from localhost (localhost [127.0.0.1])\tby phobos.labs.netnoteinc.com (Postfix) with ESMTP id BE12E43C34\tfor... <truncated>
答案 0 :(得分:0)
将两个语料库与Corpus
结合使用会将list
类型转换为简单的VCorpus
。
另一方面,使用c()
和VCorpus
将保留Corpus
类型。
将所有VCorpus
函数替换为Illuminate\Session\TokenMismatchException
,问题应该解决。
答案 1 :(得分:0)
您可以尝试这种方法:
将工作目录设置为包含ham和spam文件夹的文件夹:
setwd('/path/to/dir/that/contains/folders/')
folders <- c("easy_ham/", "spam_2/")
然后,您可以在工作目录中列出所有(在本例中为'.txt'
)个文件(path
中的默认list.files()
为'.'
)
emails <- list.files(pattern = ".txt", # assuming all emails are .txt files
recursive = TRUE) # recurse listing in subdirs
library(stringr)
library(tm)
然后,您可以使用lapply()
来阅读文件:
email_txt <- lapply(emails, function(x) {
tmp <- readLines(x)
tmp <- str_c(tmp, collapse = "")
return(tmp)
})
从阅读文本中创建语料库:
email_corpus <- VCorpus(VectorSource(email_txt))
最后从该语料库中创建DocumentTermMatrix
:
dtm_email <- DocumentTermMatrix(email_corpus)