文本返回到tm包中的R对象

时间:2016-08-03 21:46:03

标签: r text-mining tm corpus

我是tm套餐的新手,非常感谢您的帮助。我有很多帖子,我已经提取了不必要的符号和停用词,我使用tm包的各种功能(见下文)。最后,我留下了201个包含我需要的干净字符串的文档,但是,它不是R对象而是VCorpus对象。我该如何将这些处理过的文档全部拼接成一个文本文件,以便它成为一个长字符串?

换句话说,如何将VCorpus对象转换为数据框或列表或另一个R对象?

corpus <-iconv(posts$message, "latin1", "ASCII", sub="")

corpus <- Corpus(VectorSource(docs))
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removePunctuation)

corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, tolower)

#remove speical characters for emails

for(j in seq(corpus))   
{   
  corpus[[j]] <- gsub("/", " ", corpus[[j]])   
  corpus[[j]] <- gsub("@", " ", corpus[[j]])   
  corpus[[j]] <- gsub("\\|", " ", corpus[[j]])   
}   


library(SnowballC)

corpus <- tm_map(corpus, stemDocument)  

#remove common English stopwords 
docs <- tm_map(docs, removeWords, stopwords("english"))

#remove words that will be common in our given context
docs <- tm_map(docs, removeWords, c("department", "email", "job", "fresher", "internship"))

#removeUrls
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)

corpus <- tm_map(corpus, removeURL)

> corpus
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 201

1 个答案:

答案 0 :(得分:2)

语料库是纯文本文档列表。如果要将所有内容提取为字符数组,可以使用sapplycontent遍历列表以提取所有内容

使用

进行测试
# library(tm)
data("crude")
x <- tm_map(crude, stemDocument, lazy = TRUE)
x <- tm_map(x, content_transformer(tolower))

xx <- sapply(x, content)
str(xx)

如果您需要列表,请使用lapply而不是sapply