(R)关于DocumentTermMatrix中的停用词

时间:2019-03-21 08:44:50

标签: text-mining tm stop-words

我对DocumentTermMatrix()及其停用词有疑问。 我输入的内容如下,但无法获得想要的结果。

text <- "text is my text but also his text."
mycorpus <- VCorpus(VectorSource(text))
mydtm <- DocumentTermMatrix(mycorpus, control=list(stopwords=F))
lapply(mycorpus, function(x){str_extract_all(x, boundary("word"))}) %>% unlist() %>% table()
.
also  but  his   is   my text 
   1    1    1    1    1    3 
apply(mydtm, 2, sum)
 also   but   his  text text. 
    1     1     1     2     1 

首先是,即使我使用了stopwords=F,dtm仍然删除了一些停用词,例如“ is”。但是,尽管在stopwords("en")stopwords("SMART")中都列出了“ his”,但它并未删除。 因此,我真的不明白DTM使用哪些停用词以及为什么stopwords=F不起作用。而我应该怎么做才能使其正常工作?

1 个答案:

答案 0 :(得分:0)

您可以尝试以下替代软件包: quanteda 。它使您可以在标记化之后或创建文档功能矩阵后删除停用词。下面,我仅使用MDI-Parent来显示已删除与停用词匹配的标记的插槽。

pad = TRUE

或者:

library("quanteda")
## Package version: 1.4.1
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View

text <- "text is my text but also his text."

tokens(text) %>%
  tokens_remove(stopwords("en"), pad = TRUE)
## tokens from 1 document.
## text1 :
## [1] "text" ""     ""     "text" ""     "also" ""     "text" "."

英语停用词列表仅是dfm(text) ## Document-feature matrix of: 1 document, 7 features (0.0% sparse). ## 1 x 7 sparse Matrix of class "dfm" ## features ## docs text is my but also his . ## text1 3 1 1 1 1 1 1 dfm(text, remove_punct = TRUE) %>% dfm_remove(stopwords("en")) ## Document-feature matrix of: 1 document, 2 features (0.0% sparse). ## 1 x 2 sparse Matrix of class "dfm" ## features ## docs text also ## text1 3 1 函数返回的字符向量(实际上来自停用词包)。默认的英语列表与stopwords()相同,除了 tm 软件包中包含“ will”。 (如果要使用SMART列表,则为tm::stopwords("en")。)

stopwords("en", source = "smart")