我使用tm软件包阻止完成我创建的语料库这个词有问题。
以下是我的代码中最重要的部分:
# Build a corpus, and specify the source to be character vectors
corpus <- Corpus(VectorSource(comments_final$textOriginal))
corpus
# Convert to lower case
corpus <- tm_map(corpus, content_transformer(tolower))
# Remove URLs
removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
corpus <- tm_map(corpus, content_transformer(removeURL))
# Remove anything other than English letters or space
removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*", "", x)
corpus <- tm_map(corpus, content_transformer(removeNumPunct))
# Remove stopwords
myStopwords <- c(setdiff(stopwords('english'), c("r", "big")),
"use", "see", "used", "via", "amp")
corpus <- tm_map(corpus, removeWords, myStopwords)
# Remove extra whitespace
corpus <- tm_map(corpus, stripWhitespace)
# Remove other languages or more specifically anything with a non "a-z" and "0-9" character
corpus <- tm_map(corpus, content_transformer(function(s){
gsub(pattern = '[^a-zA-Z0-9\\s]+',
x = s,
replacement = " ",
ignore.case = TRUE,
perl = TRUE)
}))
# Keep a copy of the generated corpus for stem completion later as dictionary
corpus_copy <- corpus
# Stemming words of corpus
corpus <- tm_map(corpus, stemDocument, language="english")
现在要完成词干,我应用了tm包的stemCompletion。
# Completing the stemming with the generated dictionary
corpus <- tm_map(corpus, content_transformer(stemCompletion), dictionary = corpus_copy, type="prevalent")
然而,这是我的语料库被破坏并搞砸了并且stemCompletion无法正常工作的地方。特别是,R没有表示错误,代码运行但结果很糟糕。
有人知道这方面的解决方案吗?顺便说一句,我的“comments_final”数据框由youtube评论组成,我使用tubeR软件包下载了这些评论。
非常感谢你提前帮助,我真的需要帮助我的硕士论文谢谢你。
答案 0 :(得分:1)
它看起来似乎有点奇怪,所以我想出了自己的stemCompletion函数并将其应用到语料库中。在你的情况下试试这个:
stemCompletion2 <- function(x, dictionary) {
# split each word and store it
x <- unlist(strsplit(as.character(x), " "))
# # Oddly, stemCompletion completes an empty string to
# a word in dictionary. Remove empty string to avoid issue.
x <- x[x != ""]
x <- stemCompletion(x, dictionary=dictionary)
x <- paste(x, sep="", collapse=" ")
PlainTextDocument(stripWhitespace(x))
}
corpus <- lapply(corpus, stemCompletion2, corpus_copy)
corpus <- as.VCorpus(corpus)`
希望这有帮助!