只保留文档术语矩阵R中的某些双字母组合

时间:2017-02-09 00:52:04

标签: r

问题:我如何才能将文件条目矩阵或我要保留的双字母组(条款)列表中的“二元组”保持“不精彩”?

我想将它应用于一个非常大的文档术语矩阵。我尝试将术语矩阵转换为矩阵,但矢量大小超过1000 Gb。

代码:

dd <- data.frame(
id = 10:13,
text = c("No wonderful, then, that ever",
       "So that in many cases such a ",
       "But there were still other and",
       "Not even at the rationale"), stringsAsFactors = F)

library(tm)
library(RWeka)

myReader <- readTabular(mapping = list(content = "text", id = "id"))

#create v corpus
tm <- VCorpus(DataframeSource(dd), readerControl = list(reader = myReader))

#n-gram tokenizer
Tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))

#create document term matrix using Tokenizer
       dtm <- TermDocumentMatrix(tm, control = list(tokenize = Tokenizer))
       inspect(dtm)

输出:

                             Docs
            Terms           10 11 12 13
            at the          0  0  0  1
            but there       0  0  1  0
            cases such      0  1  0  0
            even at         0  0  0  1
            in many         0  1  0  0
            many cases      0  1  0  0
            no wonderful    1  0  0  0
            not even        0  0  0  1
            other and       0  0  1  0
            so that         0  1  0  0
            still other     0  0  1  0
            such a          0  1  0  0
            that ever       1  0  0  0
            that in         0  1  0  0
            the rationale   0  0  0  1
            then that       1  0  0  0
            there were      0  0  1  0
            were still      0  0  1  0
            wonderful then  1  0  0  0

1 个答案:

答案 0 :(得分:0)

认为它更复杂,因为它是DTM。

问题解决了:

    d_sel <- dtm[c('no wonderful', 'there were'),]
    inspect(d_sel)

                Docs
                Terms          10 11 12 13
                no wonderful    1  0  0  0
                there were      0  0  1  0