LDA& R中的主题建模 - 主题,词语和概率

时间:2016-03-09 04:00:44

标签: r lda

我使用以下代码运行LDA并获取与主题相关的主题和单词。

keythemes <- function(x, stp = NULL){
        suppressPackageStartupMessages(library(lda))
        suppressPackageStartupMessages(library(tm))
        suppressPackageStartupMessages(library(stringr))
        x <- iconv(a$CONTENT,"WINDOWS-1252","UTF-8")
        myCorpus <- Corpus(VectorSource(x))   
        myCorpus <- tm_map(myCorpus, content_transformer(tolower), mc.cores = 1)
        myCorpus <- tm_map(myCorpus, removePunctuation, mc.cores = 1)
        myCorpus <- tm_map(myCorpus, removeNumbers, mc.cores = 1)
        myStopwords <- c(stopwords("english"), stp)
        myCorpus <- tm_map(myCorpus, removeWords, myStopwords, mc.cores = 1)
        s <- tm_map(myCorpus, stemDocument, mc.cores = 1)
        s <- TermDocumentMatrix(myCorpus, control=list(minWordLengths = 3))
        a.tdm.sp <- removeSparseTerms(s, sparse = 0.99)  
        suppressPackageStartupMessages(require(slam))
        a.tdm.sp.t <- t(a.tdm.sp) 
        term_tfidf <- tapply(a.tdm.sp.t$v/row_sums(a.tdm.sp.t)[a.tdm.sp.t$i], a.tdm.sp.t$j,mean) * log2(nDocs(a.tdm.sp.t)/col_sums(a.tdm.sp.t>0)) # calculate tf-idf values
        a.tdm.sp.t.tdif <- a.tdm.sp.t[,term_tfidf>=1.0] 
        a.tdm.sp.t.tdif <- a.tdm.sp.t[row_sums(a.tdm.sp.t) > 0, ]
        suppressPackageStartupMessages(require(topicmodels))
        best.model <- lapply(seq(2, 3, by = 1), function(d){LDA(a.tdm.sp.t.tdif, d)}) 
        best.model.logLik <- as.data.frame(as.matrix(lapply(best.model, logLik)))  
        best.model.logLik.df <- data.frame(topics=c(2:3), LL = as.numeric(as.matrix(best.model.logLik)))
        best.model.logLik.df.sort <- best.model.logLik.df[order(-best.model.logLik.df$LL), ] 
        ntop <- best.model.logLik.df.sort[1,]$topics
        set.seed(375)
        layout(matrix(c(1, 2), nrow=2), heights=c(1, 6))
        par(mar=rep(0, 4))
        plot.new()
        text(x=0.5, y=0.5, "Key themes based on the key words chosen. \n Themes are populated using Latent Dirichlet Allocation.", cex = 1.2)
        lda <- LDA(a.tdm.sp.t.tdif, ntop) # generate a LDA model the optimum number of topics
        a <- get_terms(lda, 5) # get keywords for each topic, just for a quick look
        a <- data.frame(a)
        suppressPackageStartupMessages(library(gridExtra))
        grid.table(a)

}     

我如何获得主题和每个主题中每个单词的概率值。我想要的输出如下:

Topic 1 Prob.Values  Topic 2 Prop.Values
offer       0.72      women       0.24 
amazon      0.01      shoes       0.06 
footwear    0.04      size        0.02 
flat        0.07      million     0.22

现在我只得到主题和单词。我尝试过探索gamma和beta值,而lda@gamma提供了各个主题中每个文档的比例分布,而lda@beta为每个主题提供了每个单词的分数。

我不确定β分数是实际概率分数还是对数似然分数,因为这些值超过100且许多分数为负分数。可重现的数据示例如下:

structure(list(article_id = c(4.43047e+11, 4.45992e+11, 4.45928e+11, 
4.45692e+11, 4.4574e+11, 4.43754e+11), CONTENT = c("http://www.koovs.com/women/dresses/brand-koovs/sortby-price-low/ Coupon: DRESS50 Validi tii: 17th November Not valid on discounted products.", 
"Jabong has a lot to offer this winter season. So are you ready to click and pick on the all new winter store where all the products you choose are under the budget price of Rs 999 with massive discount of", 
"daughters (Sophia, Sistine and Scarlet) all wore beautiful dresses. 'GMA' Hot List: Jeff Bezos, Sylvester Stallone and a Puppy Party. More. Amazon's Jeff Bezos weights in on making space history and more in today's 60-second hot list. 1:10 | 11/24/15. Share. Title. Description. Share From. Share With. Facebook...", 
"Bags,Wallets and Belts -- AT, wildcrafts & more starting 134 only only on app Main link äóñ http://dl.flipkart.com/dl/bags-wallets-belts/pr... 134 only http://www.flipkart.com/grabbit-men-black-walle...", 
"not revert to a Techcircle.in query till the time of filing this report. Rajan has been the mobile business head of Flipkart-controlled lifestyle e-tailer Myntra since June last year. An alumnus of Delhi College of Engineering and IIM Ahmedabad, Rajan is also the co-founder of Easy2commute.com, a carpooling...", 
NA)), .Names = c("article_id", "CONTENT"), row.names = c(1299L, 
1710L, 1822L, 2371L, 2456L, 2896L), class = "data.frame")

1 个答案:

答案 0 :(得分:0)

@beta是每个主题的对数字分布,因此您可以使用此代码将其转换为简单的概率分布:

Terms.Probability<-10^t(lda@beta)

现在,Terms.Probability显示每个主题的每个术语分布的0到1之间的数字。