将主题映射回R中的原始数据帧

时间:2017-10-02 12:21:13

标签: r

我已经从excel读取数据到R,数据包括459行和3列。

library(openxlsx)
datamg <- read.xlsx("GC1.xlsx",sheet=1,startRow = 1,colNames = 
TRUE,skipEmptyRows = TRUE)
head(datamg,3)

                  Q                                   Themes1     Themes2
1 yes I believe it . Because the risk limits       Nature of risk    <NA>
2 Yes but a very low risk                                   Other    <NA>
3 worried about potential regulations         Regulatory concerns    <NA>

我使用tm包创建了语料库。还使用Rweka包创建了一个unigram。

tdm1 <- TermDocumentMatrix(myCorpus1, control = list(tokenize = UnigramTokenizer))
inspect(tdm1)

<<TermDocumentMatrix (terms: 877, documents: 459)>>
Non-/sparse entries: 2714/399829
Sparsity           : 99%
Maximal term length: 13
Weighting          : term frequency (tf)
Sample             :
           Docs
Terms       149 15 204 206 256 258 279 358 400 74
  busi        0  0   0   0   0   1   0   0   1  0
  chang       0  0   0   1   0   0   0   0   0  0
  compani     0  0   0   0   0   0   0   0   0  0
  disrupt     1  0   0   0   0   0   1   1   0  0
  growth      0  2   0   0   0   0   0   0   0  0
  market      0  0   0   0   0   0   0   0   0  0
  new         0  0   0   0   0   1   0   0   0  0
  product     1  0   0   0   0   2   0   1   0  0
  risk        0  0   0   0   1   0   0   0   1  0
  technolog   1  0   0   0   0   0   1   0   0  0

此后使用topicmodels包获取前8个主题。每个主题包含2个术语

#Topic Modelling
dtm <- as.DocumentTermMatrix(tdm1)
library(topicmodels)
lda <- LDA(dtm, k = 10) # find 8 topics
term <- terms(lda, 2) # first 7 terms of every topic
(term <- apply(term, MARGIN = 2, paste, collapse = ", "))



      Topic 1 Topic 2   Topic 3   Topic 4     Topic 5  Topic 6     Topic 7  Topic 8 Topic 9    Topic 10 
[1,] "busi"  "disrupt" "busi"    "risk"      "new"    "new"       "mani"   "chang" "chang"    "risk"   
[2,] "new"   "compani" "product" "technolog" "market" "technolog" "market" "price" "competit" "disrupt"

我需要帮助才能将这些主题链接回原始数据集的每一行。

示例:

         Q                                   Themes1     Themes2       Topic Mapped
    1 yes I believe it . Because the risk limits       Nature of risk    <NA>  
    2 Yes but a very low risk                                   Other    <NA>
    3 worried about potential regulations         Regulatory concerns    <NA>

我以为我可以用grep做到这一点,但无法使它工作。 感谢您对此的帮助。谢谢

1 个答案:

答案 0 :(得分:1)

为了将它们映射回原始数据集,您必须在“语料库和文档术语矩阵”中为每个文档添加唯一标识符。由于您没有行ID(或某种唯一键),因此我根据行号创建一个行ID,并将其添加到原始数据集中:

library(dplyr)
library(tm)
library(topicmodels)
library(tidytext)

datamg$doc_id <- 1:nrow(datamg)

datamg <- datamg %>% 
  select(doc_id, Q) %>%
  rename('text' = Q)

我只保留这两列,并给它们分别命名为'doc_id'和'text',因为在将id附加到语料库时,tm包(DataframeSource函数)需要它。

myCorpus1 <- Corpus(DataframeSource(datamg))

使用此语料库,您可以创建DTM并像以前一样运行LDA模型。之后,您创建“伽玛矩阵”(每个文档每个主题):

document_topic <- as.data.frame(tidy(lda, matrix = "gamma"))
document_topic$document <- as.integer(document_topic$document)

document_topic <- document_topic %>%
  group_by(document) %>%
  top_n(1) %>%
  ungroup()

这将为您提供一个包含每个行ID和一个主题的数据框(您可能会获得多个主题,例如,可能包含多个主题的稀疏句子)。然后,您可以将其与原始数据框架重新结合起来

df_join <- inner_join(datamg, document_topic, by = c("Q" = "document"))