我正在使用主题建模来聚类文档。我需要提出最佳的主题数字。因此,我决定使用主题10,20,... 60进行十倍交叉验证。
我已将我的语料库分成十批,并留出一批用于保留集。我使用9个批次(总共180个文档)运行潜在的dirichlet分配(LDA),主题为10到60.现在,我必须计算保持集的困惑或记录可能性。
我在CV的一个讨论会上找到了this code。我真的不明白下面的几行代码。我有dtm矩阵使用holdout set(20个文档)。但我不知道如何计算这个坚持集的困惑或记录可能性。
问题:
有人可以向我解释一下seq(2,100,by = 1)的含义吗?那么,AssociatedPress [21:30]是什么意思?这里有什么功能(k)?
best.model <- lapply(seq(2, 100, by=1), function(k){ LDA(AssociatedPress[21:30,], k) })
如果我想计算名为dtm的保持集的困惑或记录可能性,是否有更好的代码?我知道有perplexity()
和logLik()
函数,但由于我是新手,我无法弄清楚如何使用我的保持矩阵(称为dtm)来实现它。
如何使用包含200个文档的语料库进行十倍交叉验证?是否存在我可以调用的现有代码?为此目的,我找到caret
,但再次无法解决这个问题。
答案 0 :(得分:22)
我在CV上写了你所提到的答案,这里有更多细节:
seq(2, 100, by =1)
只是创建一个从2到100的数字序列,所以2,3,4,5,... 100.这些是我想在其中使用的主题数量楷模。一个模型有2个主题,另一个有3个主题,另一个有4个主题,等等100个主题。
AssociatedPress[21:30]
只是topicmodels
包中内置数据的一个子集。我只是在该示例中使用了一个子集,以便它运行得更快。
关于最佳主题编号的一般问题,我现在按照Martin的例子 关于谐波均值模型选择的思考(他的论文中的4.3.3,在这里:http://epub.wu.ac.at/3558/1/main.pdf)。我现在就是这样做的:
library(topicmodels)
#
# get some of the example data that's bundled with the package
#
data("AssociatedPress", package = "topicmodels")
harmonicMean <- function(logLikelihoods, precision=2000L) {
library("Rmpfr")
llMed <- median(logLikelihoods)
as.double(llMed - log(mean(exp(-mpfr(logLikelihoods,
prec = precision) + llMed))))
}
# The log-likelihood values are then determined by first fitting the model using for example
k = 20
burnin = 1000
iter = 1000
keep = 50
fitted <- LDA(AssociatedPress[21:30,], k = k, method = "Gibbs",control = list(burnin = burnin, iter = iter, keep = keep) )
# where keep indicates that every keep iteration the log-likelihood is evaluated and stored. This returns all log-likelihood values including burnin, i.e., these need to be omitted before calculating the harmonic mean:
logLiks <- fitted@logLiks[-c(1:(burnin/keep))]
# assuming that burnin is a multiple of keep and
harmonicMean(logLiks)
所以要在一系列具有不同主题数量的主题模型上做到这一点......
# generate numerous topic models with different numbers of topics
sequ <- seq(2, 50, 1) # in this case a sequence of numbers from 1 to 50, by ones.
fitted_many <- lapply(sequ, function(k) LDA(AssociatedPress[21:30,], k = k, method = "Gibbs",control = list(burnin = burnin, iter = iter, keep = keep) ))
# extract logliks from each topic
logLiks_many <- lapply(fitted_many, function(L) L@logLiks[-c(1:(burnin/keep))])
# compute harmonic means
hm_many <- sapply(logLiks_many, function(h) harmonicMean(h))
# inspect
plot(sequ, hm_many, type = "l")
# compute optimum number of topics
sequ[which.max(hm_many)]
## 6
这是输出,沿着x轴有多个主题,表明6个主题是最佳的。
主题模型的交叉验证在包中附带的文档中有详细记录,请参见此处:http://cran.r-project.org/web/packages/topicmodels/vignettes/topicmodels.pdf尝试一下,然后回过头来讨论使用主题编写CV的更具体的问题楷模。
答案 1 :(得分:16)
这个问题的接受答案是好的,但它实际上并没有解决如何估计验证数据集的困惑以及如何使用交叉验证。
Perplexity衡量概率模型与新数据集的匹配程度。在topicmodels
R包中,很容易适应perplexity
函数,该函数将先前适合的主题模型和新数据集作为参数,并返回单个数字。越低越好。
例如,将AssociatedPress
数据拆分为训练集(75%的行)和验证集(25%的行):
# load up some R packages including a few we'll need later
library(topicmodels)
library(doParallel)
library(ggplot2)
library(scales)
data("AssociatedPress", package = "topicmodels")
burnin = 1000
iter = 1000
keep = 50
full_data <- AssociatedPress
n <- nrow(full_data)
#-----------validation--------
k <- 5
splitter <- sample(1:n, round(n * 0.75))
train_set <- full_data[splitter, ]
valid_set <- full_data[-splitter, ]
fitted <- LDA(train_set, k = k, method = "Gibbs",
control = list(burnin = burnin, iter = iter, keep = keep) )
perplexity(fitted, newdata = train_set) # about 2700
perplexity(fitted, newdata = valid_set) # about 4300
验证集的困惑程度高于训练集,因为主题已根据训练集进行了优化。
这种想法扩展到交叉验证很简单。将数据划分为不同的子集(比如5),每个子集作为验证集转一圈,作为训练集的一部分转四圈。但是,它实际上是计算密集型的,特别是在尝试大量主题时。
您可以使用caret
来执行此操作,但我怀疑它还没有处理主题建模。无论如何,这是我喜欢做的事情,以确保我理解发生了什么。
下面的代码,即使在7个逻辑CPU上进行并行处理,也需要3.5小时才能在我的笔记本电脑上运行:
#----------------5-fold cross-validation, different numbers of topics----------------
# set up a cluster for parallel processing
cluster <- makeCluster(detectCores(logical = TRUE) - 1) # leave one CPU spare...
registerDoParallel(cluster)
# load up the needed R package on all the parallel sessions
clusterEvalQ(cluster, {
library(topicmodels)
})
folds <- 5
splitfolds <- sample(1:folds, n, replace = TRUE)
candidate_k <- c(2, 3, 4, 5, 10, 20, 30, 40, 50, 75, 100, 200, 300) # candidates for how many topics
# export all the needed R objects to the parallel sessions
clusterExport(cluster, c("full_data", "burnin", "iter", "keep", "splitfolds", "folds", "candidate_k"))
# we parallelize by the different number of topics. A processor is allocated a value
# of k, and does the cross-validation serially. This is because it is assumed there
# are more candidate values of k than there are cross-validation folds, hence it
# will be more efficient to parallelise
system.time({
results <- foreach(j = 1:length(candidate_k), .combine = rbind) %dopar%{
k <- candidate_k[j]
results_1k <- matrix(0, nrow = folds, ncol = 2)
colnames(results_1k) <- c("k", "perplexity")
for(i in 1:folds){
train_set <- full_data[splitfolds != i , ]
valid_set <- full_data[splitfolds == i, ]
fitted <- LDA(train_set, k = k, method = "Gibbs",
control = list(burnin = burnin, iter = iter, keep = keep) )
results_1k[i,] <- c(k, perplexity(fitted, newdata = valid_set))
}
return(results_1k)
}
})
stopCluster(cluster)
results_df <- as.data.frame(results)
ggplot(results_df, aes(x = k, y = perplexity)) +
geom_point() +
geom_smooth(se = FALSE) +
ggtitle("5-fold cross-validation of topic modelling with the 'Associated Press' dataset",
"(ie five different models fit for each candidate number of topics)") +
labs(x = "Candidate number of topics", y = "Perplexity when fitting the trained model to the hold-out set")
我们在结果中看到200个主题太多并且有一些过度拟合,而50个太少。在所尝试的主题数量中,100个是最好的,在五个不同的保持集合中具有最低的平均困惑。