Question

我在了解Spark Mlib中的LDA主题模型结果时遇到了一些麻烦。

根据我的理解，我们会得到如下结果：

receive_packet.setLength(buffer.length);

我将LDA应用于Spark Mllib的示例数据，如下所示：

 Topic 1: term1, term2, term....
 Topic 2: term1, term2, term3...
 ...
 Topic n: term1, ........

 Doc1 : Topic1, Topic2,...
 Doc2 : Topic1, Topic2,...
 Doc3 : Topic1, Topic2,...
 ...
 Docn ：Topic1, Topic2,...

之后我得到以下结果：

1 2 6 0 2 3 1 1 0 0 3
1 3 0 1 3 0 0 2 0 0 1
1 4 1 0 0 4 9 0 1 2 0
2 1 0 3 0 0 5 0 2 3 9
3 1 1 9 3 0 2 0 0 1 3
4 2 0 3 4 5 1 1 1 4 0
2 1 0 3 0 0 5 0 2 2 9
1 1 1 9 2 1 2 0 0 1 3
4 4 0 3 4 2 1 3 0 0 0
2 8 2 0 3 0 2 0 2 7 2
1 1 1 9 0 2 2 0 0 3 3
4 1 0 0 4 5 1 3 0 1 0

每列都是主题的术语分布。共有3个主题，每个主题是11个词汇表的分布。

我认为有12个文档，每个文档有11个词汇表。我的麻烦是

如何找到每个文档的主题分布？
为什么每个主题都有11个词汇表的分布，而数据中共有10个不同的词汇表（0-9）？
为什么每列的总和不等于100（根据我的理解，这意味着100％）？

Answer 1

您可以通过调用获取每个文档的主题分布 DistributedLDAModel.topicDistributions()或 DistributedLDAModel.javaTopicDistributions() 在Spark 1.4中。这仅在模型优化器设置为EMLDAOptimizer（默认值）时才有效。

有an example here和the documentation here。

在Java中看起来像这样：

LDAModel ldaModel = lda.setK(k.intValue()).run(corpus);
JavaPairRDD<Long,Vector> topic_dist_over_docs = ((DistributedLDAModel) ldaModel).javaTopicDistributions();

至于第二个问题：

LDA模型为每个主题返回词汇表中每个单词的概率分布。所以，你有三个主题（三列），每个主题有11行（词汇中每个单词一行）因为词汇大小为11。

Answer 2

为什么每列的总和不等于100（根据我的理解，我的意思是100％）

使用describeTopics方法获取主题（词汇）的分布。
每个词汇的概率总和可能是1.0（差不多，但它不能精确到1.0）

java中的示例代码：

    Tuple2<int[], double[]>[] topicDesces = ldaModel.describeTopics();
    int topicCount = topicDesces.length;

    for( int t=0; t<topicCount; t++ ){

        Tuple2<int[], double[]> topic = topicDesces[t];
        System.out.print("Topic " + t + ":");

        int[] indices = topic._1();
        double[] values = topic._2();
        double sum = 0.0d;
        int wordCount = indices.length;

        for( int w=0; w<wordCount; w++ ){

            double prob = values[w];
            System.out.format("\t%d:%f", indices[w] , prob);
            sum += prob;
        }
        System.out.println( "(" + sum + ")");
    }

理解MLlib中的LDA主题模型有困难

2 个答案: