TermDocumentMatrix不在语料库上工作

时间:2017-11-06 11:15:37

标签: r term-document-matrix

尝试加载许多电子邮件文件,让R了解垃圾邮件或火腿。首先,我创建了一个语料库,我想创建一个术语文档,我收到了一个错误。如何解决?

email_corpus <- Corpus(VectorSource(NA))

setwd("C:/ham_spam/")

library(tm)
library(stringr)

email_corpus <- Corpus(VectorSource(NA))

folders <- c("easy_ham/", "spam_2/")

for(n in 1:2){
  folder <- folders[n]
  for(i in 1:length(list.files(folder))){
    email <- list.files(folder)[i]
    tmp <- readLines(str_c(folder, email))
    tmp <- str_c(tmp, collapse = "")
    tmp_corpus <- Corpus(VectorSource(tmp))
    email_corpus <- c(email_corpus, tmp_corpus)
  }
}

dtm_email <- DocumentTermMatrix(email_corpus)

这是我收到的错误

  

UseMethod(“TermDocumentMatrix”,x)中的错误:         没有适用于'TermDocumentMatrix'的方法应用于类“list”的对象

下面的

是email_corpus的示例,email_corpus是数据框列表。

$meta
$language
[1] "en"

attr(,"class")
[1] "CorpusMeta"

$dmeta
data frame with 0 columns and 1 row

$content
[1] "From Steve_Burt@cursor-system.com  Thu Aug 22 12:46:39 2002Return-Path: <Steve_Burt@cursor-system.com>Delivered-To: zzzz@localhost.netnoteinc.comReceived: from localhost (localhost [127.0.0.1])\tby phobos.labs.netnoteinc.com (Postfix) with ESMTP id BE12E43C34\tfor... <truncated>

2 个答案:

答案 0 :(得分:0)

将两个语料库与Corpus结合使用会将list类型转换为简单的VCorpus

另一方面,使用c()VCorpus将保留Corpus类型。

将所有VCorpus函数替换为Illuminate\Session\TokenMismatchException,问题应该解决。

答案 1 :(得分:0)

您可以尝试这种方法:

将工作目录设置为包含ham和spam文件夹的文件夹:

setwd('/path/to/dir/that/contains/folders/')

folders <- c("easy_ham/", "spam_2/")

然后,您可以在工作目录中列出所有(在本例中为'.txt')个文件(path中的默认list.files()'.'

emails <- list.files(pattern = ".txt", # assuming all emails are .txt files
                     recursive = TRUE) # recurse listing in subdirs

library(stringr)
library(tm)

然后,您可以使用lapply()来阅读文件:

email_txt <- lapply(emails, function(x) {
  tmp <- readLines(x)
  tmp <- str_c(tmp, collapse = "")
  return(tmp)
})

从阅读文本中创建语料库:

email_corpus <- VCorpus(VectorSource(email_txt))

最后从该语料库中创建DocumentTermMatrix

dtm_email <- DocumentTermMatrix(email_corpus)