从DTM删除单词

时间:2019-04-24 14:45:11

标签: r text tm quanteda

我创建了一个dtm。

library(tm)

corpus = Corpus(VectorSource(dat$Reviews))
dtm = DocumentTermMatrix(corpus)

我用它来删除稀有条款。

dtm = removeSparseTerms(dtm, 0.98)

removeSparseTerms之后,dtm中仍有一些术语对我的分析没有用。

tm软件包具有删除单词的功能。但是,此功能只能应用于语料库或向量。

如何从dtm中删除定义的术语?

以下是输入数据的一小部分样本:

samp = dat %>%
  select(Reviews) %>%
  sample_n(20)

dput(samp)
structure(list(Reviews = c("buenisimoooooo", "excelente", "excelent", 
"awesome phone awesome price almost month issue highly use blu manufacturer high speed processor blu iphone", 
"phone multiple failure poorly touch screen 2 slot sim card work responsible disappoint brand good team shop store wine money unfortunately precaution purchase", 
"work perfect time", "amaze buy phone smoothly update charm glte yet comparably fast several different provider sims perfectly small size definitely replacemnent simple", 
"phone work card non sim card description", "perfect reliable kinda fast even simple mobile sim digicel never problem far strongly anyone need nice expensive dual sim phone perfect gift love friend", 
"perfect", "great bang buck", "actually happy little sister really first good great picture late", 
"good phone good reception home fringe area screen lovely just right size good buy", 
"", "phone verizon contract phone buyer beware", "good phone", 
"excellent product total satisfaction", "dreadful phone home button never screen unresponsive answer call easily month phone test automatically emergency police round supplier network nothing never electricals amazon good buy locally refund", 
"good phone price fine", "phone star battery little soon yes"
)), row.names = c(12647L, 10088L, 14055L, 3720L, 6588L, 10626L, 
10362L, 1428L, 12580L, 5381L, 10431L, 2803L, 6644L, 12969L, 348L, 
10582L, 3215L, 13358L, 12708L, 7049L), class = "data.frame")

1 个答案:

答案 0 :(得分:1)

您应该尝试 quanteda ,该方法将DocumentTermMatrix称为“ dfm”(文档特征矩阵),并具有更多选项来对其进行裁剪以减少稀疏性,其中包括用于删除特定内容的函数dfm_remove()功能(术语)。

如果我们将您的samp对象重命名为dat,则:

library("quanteda")
## Package version: 1.4.3
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

corp <- corpus(dat, text_field = "Reviews")
corp
## Corpus consisting of 20 documents and 0 docvars.
tail(texts(corp), 2)
##                                12708                                 7049 
##              "good phone price fine" "phone star battery little soon yes"

dtm <- dfm(corp)
dtm
## Document-feature matrix of: 20 documents, 128 features (93.6% sparse).

现在我们可以修剪一下。对于这个小文件,稀疏度设置为0.98无效,但是我们可以根据频率阈值进行修整。

# does not actually have an effect
dfm_trim(dtm, sparsity = 0.98, verbose = TRUE)
## Note: converting sparsity into min_docfreq = 1 - 0.98 = NULL .
## No features removed.
## Document-feature matrix of: 20 documents, 128 features (93.6% sparse).

# trimming based on rare terms
dtm <- dfm_trim(dtm, min_termfreq = 3, verbose = TRUE)
## Removing features occurring:
##   - fewer than 3 times: 119
##   Total features removed: 119 (93.0%).
head(dtm)
## Document-feature matrix of: 6 documents, 9 features (83.3% sparse).
## 6 x 9 sparse Matrix of class "dfm"
##        features
## docs    phone screen sim card work good perfect buy never
##   12647     0      0   0    0    0    0       0   0     0
##   10088     0      0   0    0    0    0       0   0     0
##   14055     0      0   0    0    0    0       0   0     0
##   3720      1      0   0    0    0    0       0   0     0
##   6588      1      1   1    1    1    1       0   0     0
##   10626     0      0   0    0    1    0       1   0     0

无论如何直接回答您的问题,您希望dfm_remove()摆脱某些特定功能。

# removing from a specific list of terms
dtm <- dfm_remove(dtm, c("screen", "buy", "sim", "card"), verbose = TRUE)
## removed 4 features
## 

dtm
## Document-feature matrix of: 20 documents, 5 features (75.0% sparse).

head(dtm)
## Document-feature matrix of: 6 documents, 5 features (80.0% sparse).
## 6 x 5 sparse Matrix of class "dfm"
##        features
## docs    phone work good perfect never
##   12647     0    0    0       0     0
##   10088     0    0    0       0     0
##   14055     0    0    0       0     0
##   3720      1    0    0       0     0
##   6588      1    1    1       0     0
##   10626     0    1    0       1     0

最后,如果您仍然愿意,可以使用 quanteda dtm函数将convert()转换为 tm 格式:

convert(dtm, to = "tm")
## <<DocumentTermMatrix (documents: 20, terms: 5)>>
## Non-/sparse entries: 25/75
## Sparsity           : 75%
## Maximal term length: 7
## Weighting          : term frequency (tf)