我正在通过复制教程来研究文档分类。我们正在使用候选演讲作为培训材料。但是,在复制教程时出现以下错误:
> tdm <- lapply(candidates, generateTDM, path = pathname)
Error in inherits(x, "Source") : empty directory
我已经确认错误消息不是由目录路径引起的。我推测将候选语音保存为.RTF文本格式(在MacBook中)会导致错误消息,并且.RTF可能与我的代码在generateTDM变量下调用ANSI编码的方式不兼容。
s.cor <- Corpus(DirSource(directory = s.dir, encoding = "ANSI"))
我尝试 encoding = ASCII ,但结果失败。如何摆脱错误消息?这是完整的代码:
library(tm)
library(plyr)
library(class)
# Initialize the Environment
libs <- c("tm", "plyr", "class")
lapply(libs, require, character.only = TRUE)
# Set Options
options(stringsAsFactors = FALSE)
# Set Parameters
candidates <- c("hitler", "drum")
pathname <- "/Path Name/Is Correct/Help"
# Clean Text
cleanCorpus <- function(corpus) {
corpus.tmp <- tm_map(corpus, removePunctuation)
corpus.tmp <- tm_map(corpus.tmp, strepwhitespace)
corpus.tmp <- tm_map(corpus.tmp, tolower)
corpus.tmp <- tm_map(corpus.tmp, removewords, stopwords("english"))
return(corpus.tmp)
}
# Build Term Document Matrix
generateTDM <- function(cand, path) {
s.dir <- sprintf("%s/%s", path, cand)
s.cor <- Corpus(DirSource(directory = s.dir, encoding = "ANSI"))
s.cor.cl <- cleanCorpus(s.cor)
s.tdm <- TermDocumentMatrix(s.cor.cl)
s.tdm <- removeSparseTerms(s.tdm, 0.7)
result <- list(name = cand, tdm = s.tdm)
}
tdm <- lapply(candidates, generateTDM, path = pathname)