Question

我使用OnlineLDAOptimizer在Spark mllib中拟合LDA模型。在9M文档（推文）上安装10个主题只需要约200秒。

val numTopics=10
val lda = new LDA()
  .setOptimizer(new OnlineLDAOptimizer().setMiniBatchFraction(math.min(1.0, mbf)))
  .setK(numTopics)
  .setMaxIterations(2)
  .setDocConcentration(-1) // use default symmetric document-topic prior
  .setTopicConcentration(-1) // use default symmetric topic-word prior
val startTime = System.nanoTime()
val ldaModel = lda.run(countVectors)

/**
 * Print results
 */
// Print training time
println(s"Finished training LDA model.  Summary:")
println(s"Training time (sec)\t$elapsed")
println(s"==========")

numTopics: Int = 10
lda: org.apache.spark.mllib.clustering.LDA = org.apache.spark.mllib.clustering.LDA@72678a91
startTime: Long = 11889875112618
ldaModel: org.apache.spark.mllib.clustering.LDAModel = org.apache.spark.mllib.clustering.LocalLDAModel@351e2b4c
Finished training LDA model.  Summary:
Training time (sec) 202.640775542

但是，当我请求此模型的日志困惑时（看起来我需要先将其转发回LocalLDAModel），需要很长时间才能进行评估。为什么？（我正试图解决日志困惑，以便我可以优化k，主题＃。）

ldaModel.asInstanceOf[LocalLDAModel].logPerplexity(countVectors)
res95: Double = 7.006006572908673
Took 1212 seconds.

Answer 1

一般来说，计算困惑不是一件简单的事情： https://stats.stackexchange.com/questions/18167/how-to-calculate-perplexity-of-a-holdout-with-latent-dirichlet-allocation
同样只考虑困惑来设置主题数量可能不是正确的方法：https://www.quora.com/What-are-good-ways-of-evaluating-the-topics-generated-by-running-LDA-on-a-corpus

使用在线优化器学习的LDAModel无论如何都是LocalLDAModel类型，因此没有发生转换。我计算了本地和分布式的困惑：它们需要相当长的时间。我的意思是看代码，他们在整个数据集上嵌套了地图调用。

通话：

docBound += count * LDAUtils.logSumExp(Elogthetad + localElogbeta(idx, ::).t)

（9M *非零BOW条目）时间可能需要相当长的时间。守则来自： https://github.com/apache/spark/blob/v1.6.1/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala第312行

训练LDA在你的情况下很快，因为你只用9m / mbf更新调用训练了2次迭代。

顺便说一下。 docConcentration的默认值是Vectors.dense（-1）而不仅仅是Int。

顺便说一下。 2：感谢这个问题，我在算法上运行它的算法遇到了麻烦，因为我在其中进行了这个愚蠢的困惑计算，并且不知道它会造成太多麻烦。

为什么在Spark mllib中报告LDA模型的日志困惑如此之慢？

1 个答案: