我应用removeWords
来过滤这样的语料库:
corpus <- Corpus(vs, readerControl = list(language="en"))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, c(stopwords("english")))
corpus <- tm_map(corpus, removeWords, bannedWords$V1)
但是,这只是匹配工作完全,所以:
如何删除包含我的停用词的词组?
答案 0 :(得分:2)
您可以使用词干将被禁止的单词带回基本表单。请参阅以下示例。
library(tm)
banned <- c("buck")
text <- c("He is bucking the trend", "A buck is not worth a dollar anymore!")
corpus <- Corpus(VectorSource(text), readerControl = list(language="en"))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, stemDocument)
corpus <- tm_map(corpus, removeWords, c(stopwords("english"), banned))
writeLines(as.character(corpus[[1]]))
trend
如果您没有阻止该文件,您将获得:
corpus <- Corpus(VectorSource(text), readerControl = list(language="en"))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, c(stopwords("english"), banned))
writeLines(as.character(corpus[[1]]))
bucking trend
答案 1 :(得分:0)
我通过查看tm
库source code获取removeWords函数并扩展正则表达式来找到答案:
gsub(sprintf("(*UCP)\\b(%s)\\b",
到
gsub(sprintf("(*UCP)\\b[a-zA-Z]*(%s)[a-zA-Z]*\\b",
完整的功能
removeWordsContaining <-
function(x, words)
UseMethod("removeWordsContaining", x)
removeWordsContaining.character <-
function(x, words)
gsub(sprintf("(*UCP)\\b[a-zA-Z]*(%s)[a-zA-Z]*\\b",
paste(sort(words, decreasing = TRUE), collapse = "|")),
"", x, perl = TRUE)
removeWordsContaining.PlainTextDocument <-
content_transformer(removeWordsContaining.character)
blog_corpus <- Corpus(vs, readerControl = list(language="en"))
blog_corpus <- tm_map(blog_corpus, content_transformer(tolower))
blog_corpus <- tm_map(blog_corpus, stripWhitespace)
blog_corpus <- tm_map(blog_corpus, removePunctuation)
blog_corpus <- tm_map(blog_corpus, removeNumbers)
blog_corpus <- tm_map(blog_corpus, removeWords, c(stopwords("english")))
blog_corpus <- tm_map(blog_corpus, removeWordsContaining, bannedWords$V1)