将数据框中的变量转换为术语文档矩阵

时间:2019-02-26 13:43:33

标签: r text lexical-analysis topic-modeling

我有一个数据框,其中包含我想执行Latent Dirichlet分配的段落。为此,我需要创建一个术语文档矩阵。此示例代码显示错误:

library(qdap)
library(topicmodels)

remove(list=ls())
doc <- c(1,2,3,4)
text <- c("The Quick Brown Fox Jumped Over The Lazy Dog",
        "The Cow Jumped Over The Moon",
        "Moo, Moo, Brown Cow Have You Any Milk",
        "The Fox Went Out One Moonshiny Night")
works.df <- data.frame(doc,text)

works.tdm <- as.tdm(text.var = works.df$text,  grouping.var = works.df$doc)
works.lda <- LDA(works.tdm, k = 2, control = list(seed = 1234))

其中

  

works.tdm <-as.tdm(text.var = works.df $ text,grouping.var = works.df $ doc)       .TermDocumentMatrix(x,加权)中的错误:         参数“ weighting”丢失,没有默认值

我以为我会得到一个稀疏矩阵,例如:“ the”一词出现在文档1(频率为2),2(频率为2)和4(频率为)中。 1); “牛”一词出现在文件2和3中(频率均为1); ...

对于任何遗漏,或者是否有更好的方法来完成我的任务,任何人都可以提出建议?谢谢。

2 个答案:

答案 0 :(得分:0)

您需要按照R的要求提供权重:

library(tm)
works.tdm <- as.tdm(text.var = works.df$text,  grouping.var = works.df$doc, weighting = weightTf)

答案 1 :(得分:0)

好像我需要先变成一个语料库,然后使用更常见的DocumentTermMatrix()

> remove(list=ls())
> doc<-c(1,2,3,4)
> text<-c("The Quick Brown Fox Jumped Over The Lazy Dog",
+         "The Cow Jumped Over The Moon",
+         "Moo, Moo, Brown Cow Have You Any Milk",
+         "The Fox Went Out One Moonshiny Night")
> works.df<-data.frame(doc,text)
> corp <- VCorpus(VectorSource(works.df$text))
> works.tdm <- DocumentTermMatrix(corp, control=list(weighting=weightTf))
> works.tdm
<<DocumentTermMatrix (documents: 4, terms: 20)>>
Non-/sparse entries: 27/53
Sparsity           : 66%
Maximal term length: 9
Weighting          : term frequency (tf)
> as.matrix(works.tdm)
    Terms
Docs any brown cow dog fox have jumped lazy milk moo, moon moonshiny night one out over quick the went
   1   0     1   0   1   1    0      1    1    0    0    0         0     0   0   0    1     1   2    0
   2   0     0   1   0   0    0      1    0    0    0    1         0     0   0   0    1     0   2    0
   3   1     1   1   0   0    1      0    0    1    2    0         0     0   0   0    0     0   0    0
   4   0     0   0   0   1    0      0    0    0    0    0         1     1   1   1    0     0   1    1
    Terms
Docs you
   1   0
   2   0
   3   1
   4   0