tm_map删除包含我的停用词的词?

时间:2016-11-04 22:25:50

标签: r tm

我应用removeWords来过滤这样的语料库:

corpus <- Corpus(vs, readerControl = list(language="en")) 
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, stripWhitespace) 
corpus <- tm_map(corpus, removeWords, c(stopwords("english"))) 
corpus <- tm_map(corpus, removeWords, bannedWords$V1) 

但是,这只是匹配工作完全,所以:

  • f * ck已删除
  • f * cking未被删除

如何删除包含我的停用词的词组?

2 个答案:

答案 0 :(得分:2)

您可以使用词干将被禁止的单词带回基本表单。请参阅以下示例。

library(tm)

banned <- c("buck")
text <- c("He is bucking the trend", "A buck is not worth a dollar anymore!")

corpus <- Corpus(VectorSource(text), readerControl = list(language="en")) 
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, stripWhitespace) 
corpus <- tm_map(corpus, stemDocument)
corpus <- tm_map(corpus, removeWords, c(stopwords("english"), banned)) 

writeLines(as.character(corpus[[1]]))
  trend

如果您没有阻止该文件,您将获得:

corpus <- Corpus(VectorSource(text), readerControl = list(language="en")) 
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, stripWhitespace) 
corpus <- tm_map(corpus, removeWords, c(stopwords("english"), banned)) 

writeLines(as.character(corpus[[1]]))
  bucking  trend

答案 1 :(得分:0)

我通过查看tmsource code获取removeWords函数并扩展正则表达式来找到答案:

gsub(sprintf("(*UCP)\\b(%s)\\b",

gsub(sprintf("(*UCP)\\b[a-zA-Z]*(%s)[a-zA-Z]*\\b",

完整的功能

removeWordsContaining <-
function(x, words)
    UseMethod("removeWordsContaining", x)
removeWordsContaining.character <-
function(x, words)
    gsub(sprintf("(*UCP)\\b[a-zA-Z]*(%s)[a-zA-Z]*\\b",
                 paste(sort(words, decreasing = TRUE), collapse = "|")),
         "", x, perl = TRUE)
removeWordsContaining.PlainTextDocument <-
    content_transformer(removeWordsContaining.character)

blog_corpus <- Corpus(vs, readerControl = list(language="en")) 
blog_corpus <- tm_map(blog_corpus, content_transformer(tolower))
blog_corpus <- tm_map(blog_corpus, stripWhitespace) 
blog_corpus <- tm_map(blog_corpus, removePunctuation) 
blog_corpus <- tm_map(blog_corpus, removeNumbers) 
blog_corpus <- tm_map(blog_corpus, removeWords, c(stopwords("english"))) 
blog_corpus <- tm_map(blog_corpus, removeWordsContaining, bannedWords$V1)