Question

我对DocumentTermMatrix()及其停用词有疑问。我输入的内容如下，但无法获得想要的结果。

text <- "text is my text but also his text."
mycorpus <- VCorpus(VectorSource(text))
mydtm <- DocumentTermMatrix(mycorpus, control=list(stopwords=F))
lapply(mycorpus, function(x){str_extract_all(x, boundary("word"))}) %>% unlist() %>% table()
.
also  but  his   is   my text 
   1    1    1    1    1    3 
apply(mydtm, 2, sum)
 also   but   his  text text. 
    1     1     1     2     1

首先是，即使我使用了stopwords=F，dtm仍然删除了一些停用词，例如“ is”。但是，尽管在stopwords("en")和stopwords("SMART")中都列出了“ his”，但它并未删除。因此，我真的不明白DTM使用哪些停用词以及为什么stopwords=F不起作用。而我应该怎么做才能使其正常工作？

Answer 1

您可以尝试以下替代软件包： quanteda 。它使您可以在标记化之后或创建文档功能矩阵后删除停用词。下面，我仅使用MDI-Parent来显示已删除与停用词匹配的标记的插槽。

pad = TRUE

或者：

library("quanteda")
## Package version: 1.4.1
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View

text <- "text is my text but also his text."

tokens(text) %>%
  tokens_remove(stopwords("en"), pad = TRUE)
## tokens from 1 document.
## text1 :
## [1] "text" ""     ""     "text" ""     "also" ""     "text" "."

英语停用词列表仅是dfm(text) ## Document-feature matrix of: 1 document, 7 features (0.0% sparse). ## 1 x 7 sparse Matrix of class "dfm" ## features ## docs text is my but also his . ## text1 3 1 1 1 1 1 1 dfm(text, remove_punct = TRUE) %>% dfm_remove(stopwords("en")) ## Document-feature matrix of: 1 document, 2 features (0.0% sparse). ## 1 x 2 sparse Matrix of class "dfm" ## features ## docs text also ## text1 3 1函数返回的字符向量（实际上来自停用词包）。默认的英语列表与stopwords()相同，除了 tm 软件包中包含“ will”。（如果要使用SMART列表，则为tm::stopwords("en")。）

stopwords("en", source = "smart")

（R）关于DocumentTermMatrix中的停用词

1 个答案: