Question

我很欣赏Ben在这里的答案：LDA与topicmodels，我怎样才能看到不同文件属于哪些主题？

我的问题是：如何在最后一步保留文档标题？例如：

在单独的文本文件中手动创建三个.txt文档并将它们存储在目录~Desktop / nature_corpus

第一个文件标题：nature.txt

第一个文件内容：名词自然界，大自然母亲，地球母亲，环境;野生动植物，动植物群，农村;宇宙，宇宙。

第二个文件标题：conservation.txt

第二份文件内容：名词：热带森林保护：保护，保护，保护，保管;护理，监护，饲养，监督;保养，维护，修理，修复;生态学，环境保护主义。

第三个文件标题：bird.txt

第三份文件：喂鸟的名词：鸡;小鸡，羽翼未丰，雏鸟;非正式的羽毛朋友，小鸟;鹦鹉; （鸟类）技术鸟类。

#install.packages("tm")
#install.packages("topicmodels")
library(tm)
# Create DTM
#. The file path is a Mac file path.
corpus_nature_1 <- Corpus(DirSource("/Users/[home folder name]/Desktop/nature_corpus"),readerControl=list(reader=readPlain,language="en US")) 
corpus_nature_2 <- tm_map(corpus_nature_1,removeNumbers)
corpus_nature_3 <- tm_map(corpus_nature_2,content_transformer(tolower))
mystopwords <- c(stopwords(),"noun", "verb")
corpus_nature_4 <- tm_map(corpus_nature_3,removeWords, mystopwords)
corpus_nature_5 <- tm_map(corpus_nature_4,removePunctuation)
corpus_nature_6 <- tm_map(corpus_nature_5,stripWhitespace)
dtm_nature_1 <- DocumentTermMatrix(corpus_nature_6)

inspect(dtm_nature_1)
<<DocumentTermMatrix (documents: 3, terms: 42)>>
  Non-/sparse entries: 42/84
Sparsity           : 67%
Maximal term length: 16
Weighting          : term frequency (tf)
Sample             :
  Terms
Docs               avifauna birdie birds budgie chick feathered feeding fledgling fowl mother
bird.txt                1      1     2      1     1         1       1         1    1      0
conservation.txt        0      0     0      0     0         0       0         0    0      0
nature.txt              0      0     0      0     0         0       0         0    0      2

使用topicmodels运行主题模型：

# Run topic model 2 topics
library(topicmodels)
topicmodels_LDA_nature_2 <- LDA(dtm_nature_1,2,method="Gibbs",control=list(seed=1),model=NULL)
terms(topicmodels_LDA_nature_2,3)
     Topic 1  Topic 2   
[1,] "birds"  "avifauna"
[2,] "mother" "birdie"  
[3,] "chick"  "budgie"

如何保留文档标题（在inspect（dtm_nature_1）行中可见）？

# Create CSV Matrix 2 topics
matrix_nature_2 <- as.data.frame(topicmodels_LDA_nature_2@gamma)
names(matrix_nature_2) <- c(1:2)
write.csv(matrix_nature_2,"matrix_nature_2.csv")

#. Rows in this table are documents, columns are topics.
    1           2
1   0.46875     0.53125
2   0.52238806  0.47761194
3   0.555555556 0.444444444

感谢。

Answer 1

我发现了这种解决方法，但如果有一个更整洁的解决方案，我们仍会感激不尽。运行上面的所有代码后，运行：

wordMatrix = as.data.frame( t(as.matrix(dtm_nature_1)) )
write.csv(wordMatrix,"dtm_nature_1.csv")

然后导入从此代码派生的CSV文件（从上面）：

matrix_nature_2 <- as.data.frame(topicmodels_LDA_nature_2@gamma)
names(matrix_nature_2) <- c(1:2)
write.csv(matrix_nature_2,"matrix_nature_2.csv")

进入excel，然后将dtm_nature_1.csv导入到excel文件的第二张表中。然后从dtm_nature_1.csv复制文档标题列（列标题）并粘贴special将它们转换为矩阵的清晰列，用于matrix_nature_2.csv。

具有topicmodels（R）的LDA，如何查看不同文档属于哪些主题，并保留文档标题？

1 个答案: