我正在研究R中的LDA并试图评估我的模型对于不同主题值k的困惑,以便了解对于困惑的好处是什么。然而,我注意到,为了增加k的值,困惑似乎上升(我相信,它不应该)。我能够使用AssociatedPress {topicmodels}数据集重新创建此问题。这是代码:
data("AssociatedPress")
splitter_AP <- sample(1:nrow(AssociatedPress), (nrow(AssociatedPress))*0.25)
train_set_AP <- AssociatedPress[-splitter_AP, ]
valid_set_AP <- AssociatedPress[splitter_AP, ]
#Set parameters for Gibbs sampling
burnin <- 1000
iter <- 2000
seed <-list(2003,5,63,100001,765)
nstart <- 5
best <- TRUE
verbose <- 100
# Run LDA (I repeated the next step using values 10, 20 and 30 for k in this example)
ldaOut_AP10 <-LDA(train_set_AP,10, method="Gibbs", control=list(nstart=nstart,
seed = seed,
best = best,
burnin = burnin,
iter = iter,
verbose=verbose))
perplexity(ldaOut_AP10, newdata=valid_set_AP, estimate_theta=FALSE) # returned 5544.164
perplexity(ldaOut_AP20, newdata=valid_set_AP, estimate_theta=FALSE) # returned 5755.367
perplexity(ldaOut_AP30, newdata=valid_set_AP, estimate_theta=FALSE) # returned 5808.529
This post非常好地表明,的困惑应该不会降低。我只是看不出我出错的地方。非常感谢任何帮助!