我已经从excel读取数据到R,数据包括459行和3列。
library(openxlsx)
datamg <- read.xlsx("GC1.xlsx",sheet=1,startRow = 1,colNames =
TRUE,skipEmptyRows = TRUE)
head(datamg,3)
Q Themes1 Themes2
1 yes I believe it . Because the risk limits Nature of risk <NA>
2 Yes but a very low risk Other <NA>
3 worried about potential regulations Regulatory concerns <NA>
我使用tm包创建了语料库。还使用Rweka包创建了一个unigram。
tdm1 <- TermDocumentMatrix(myCorpus1, control = list(tokenize = UnigramTokenizer))
inspect(tdm1)
<<TermDocumentMatrix (terms: 877, documents: 459)>>
Non-/sparse entries: 2714/399829
Sparsity : 99%
Maximal term length: 13
Weighting : term frequency (tf)
Sample :
Docs
Terms 149 15 204 206 256 258 279 358 400 74
busi 0 0 0 0 0 1 0 0 1 0
chang 0 0 0 1 0 0 0 0 0 0
compani 0 0 0 0 0 0 0 0 0 0
disrupt 1 0 0 0 0 0 1 1 0 0
growth 0 2 0 0 0 0 0 0 0 0
market 0 0 0 0 0 0 0 0 0 0
new 0 0 0 0 0 1 0 0 0 0
product 1 0 0 0 0 2 0 1 0 0
risk 0 0 0 0 1 0 0 0 1 0
technolog 1 0 0 0 0 0 1 0 0 0
此后使用topicmodels包获取前8个主题。每个主题包含2个术语
#Topic Modelling
dtm <- as.DocumentTermMatrix(tdm1)
library(topicmodels)
lda <- LDA(dtm, k = 10) # find 8 topics
term <- terms(lda, 2) # first 7 terms of every topic
(term <- apply(term, MARGIN = 2, paste, collapse = ", "))
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
[1,] "busi" "disrupt" "busi" "risk" "new" "new" "mani" "chang" "chang" "risk"
[2,] "new" "compani" "product" "technolog" "market" "technolog" "market" "price" "competit" "disrupt"
我需要帮助才能将这些主题链接回原始数据集的每一行。
示例:
Q Themes1 Themes2 Topic Mapped
1 yes I believe it . Because the risk limits Nature of risk <NA>
2 Yes but a very low risk Other <NA>
3 worried about potential regulations Regulatory concerns <NA>
我以为我可以用grep做到这一点,但无法使它工作。 感谢您对此的帮助。谢谢
答案 0 :(得分:1)
为了将它们映射回原始数据集,您必须在“语料库和文档术语矩阵”中为每个文档添加唯一标识符。由于您没有行ID(或某种唯一键),因此我根据行号创建一个行ID,并将其添加到原始数据集中:
library(dplyr)
library(tm)
library(topicmodels)
library(tidytext)
datamg$doc_id <- 1:nrow(datamg)
datamg <- datamg %>%
select(doc_id, Q) %>%
rename('text' = Q)
我只保留这两列,并给它们分别命名为'doc_id'和'text',因为在将id附加到语料库时,tm包(DataframeSource函数)需要它。
myCorpus1 <- Corpus(DataframeSource(datamg))
使用此语料库,您可以创建DTM并像以前一样运行LDA模型。之后,您创建“伽玛矩阵”(每个文档每个主题):
document_topic <- as.data.frame(tidy(lda, matrix = "gamma"))
document_topic$document <- as.integer(document_topic$document)
document_topic <- document_topic %>%
group_by(document) %>%
top_n(1) %>%
ungroup()
这将为您提供一个包含每个行ID和一个主题的数据框(您可能会获得多个主题,例如,可能包含多个主题的稀疏句子)。然后,您可以将其与原始数据框架重新结合起来
df_join <- inner_join(datamg, document_topic, by = c("Q" = "document"))