lapply:继承(x,“ Source”)中的错误:空目录

时间:2018-08-25 17:41:26

标签: r machine-learning lapply document-classification

我正在通过复制教程来研究文档分类。我们正在使用候选演讲作为培训材料。但是,在复制教程时出现以下错误:

> tdm <- lapply(candidates, generateTDM, path = pathname)
Error in inherits(x, "Source") : empty directory

我已经确认错误消息不是由目录路径引起的。我推测将候选语音保存为.RTF文本格式(在MacBook中)会导致错误消息,并且.RTF可能与我的代码在generateTDM变量下调用ANSI编码的方式不兼容。

s.cor <- Corpus(DirSource(directory = s.dir, encoding = "ANSI"))

我尝试 encoding = ASCII ,但结果失败。如何摆脱错误消息?这是完整的代码:

library(tm)
library(plyr)
library(class)

# Initialize the Environment
libs <- c("tm", "plyr", "class")
lapply(libs, require, character.only = TRUE)

# Set Options
options(stringsAsFactors = FALSE)

# Set Parameters
candidates <- c("hitler", "drum")
pathname <- "/Path Name/Is Correct/Help"

# Clean Text
cleanCorpus <- function(corpus) {
  corpus.tmp <- tm_map(corpus, removePunctuation)
  corpus.tmp <- tm_map(corpus.tmp, strepwhitespace)
  corpus.tmp <- tm_map(corpus.tmp, tolower)
  corpus.tmp <- tm_map(corpus.tmp, removewords, stopwords("english"))
  return(corpus.tmp)

}

# Build Term Document Matrix
generateTDM <- function(cand, path) {
  s.dir <- sprintf("%s/%s", path, cand)
  s.cor <- Corpus(DirSource(directory = s.dir, encoding = "ANSI"))
  s.cor.cl <- cleanCorpus(s.cor)
  s.tdm <- TermDocumentMatrix(s.cor.cl)

  s.tdm <- removeSparseTerms(s.tdm, 0.7)
  result <- list(name = cand, tdm = s.tdm)

}

tdm <- lapply(candidates, generateTDM, path = pathname)

0 个答案:

没有答案