我对DocumentTermMatrix()
及其停用词有疑问。
我输入的内容如下,但无法获得想要的结果。
text <- "text is my text but also his text."
mycorpus <- VCorpus(VectorSource(text))
mydtm <- DocumentTermMatrix(mycorpus, control=list(stopwords=F))
lapply(mycorpus, function(x){str_extract_all(x, boundary("word"))}) %>% unlist() %>% table()
.
also but his is my text
1 1 1 1 1 3
apply(mydtm, 2, sum)
also but his text text.
1 1 1 2 1
首先是,即使我使用了stopwords=F
,dtm仍然删除了一些停用词,例如“ is”。但是,尽管在stopwords("en")
和stopwords("SMART")
中都列出了“ his”,但它并未删除。
因此,我真的不明白DTM使用哪些停用词以及为什么stopwords=F
不起作用。而我应该怎么做才能使其正常工作?
答案 0 :(得分:0)
您可以尝试以下替代软件包: quanteda 。它使您可以在标记化之后或创建文档功能矩阵后删除停用词。下面,我仅使用MDI-Parent
来显示已删除与停用词匹配的标记的插槽。
pad = TRUE
或者:
library("quanteda")
## Package version: 1.4.1
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
text <- "text is my text but also his text."
tokens(text) %>%
tokens_remove(stopwords("en"), pad = TRUE)
## tokens from 1 document.
## text1 :
## [1] "text" "" "" "text" "" "also" "" "text" "."
英语停用词列表仅是dfm(text)
## Document-feature matrix of: 1 document, 7 features (0.0% sparse).
## 1 x 7 sparse Matrix of class "dfm"
## features
## docs text is my but also his .
## text1 3 1 1 1 1 1 1
dfm(text, remove_punct = TRUE) %>%
dfm_remove(stopwords("en"))
## Document-feature matrix of: 1 document, 2 features (0.0% sparse).
## 1 x 2 sparse Matrix of class "dfm"
## features
## docs text also
## text1 3 1
函数返回的字符向量(实际上来自停用词包)。默认的英语列表与stopwords()
相同,除了 tm 软件包中包含“ will”。 (如果要使用SMART列表,则为tm::stopwords("en")
。)
stopwords("en", source = "smart")