如何将主题索引转换为LDA中的主题词

时间:2016-09-02 03:22:12

标签: scala apache-spark lda

如何从LDA模型中获取vocabArray(org.apache.spark.ml.clustering.LDA)。我刚刚收到vocabSize,它返回扫描的字数。

理想情况下,我需要模型中的实际单词数组,然后根据termindices我想要查看存储桶中的单词。

我需要在scala中执行此操作。任何建议都会有所帮助。

到目前为止我尝试过的事情,我的主题是数据框

topicIndices: org.apache.spark.sql.DataFrame = [topic: int, termIndices: array<int>, termWeights: array<double>]

我正在尝试获取此类主题

val topics = topicIndices.map { case (terms, termWeights) =>
      terms.zip(termWeights).map { case (term, weight) => (vocabArray(term.toInt), weight) }
    }

但它会引发以下错误

> 

val topics = topicIndices.map { case (terms, termWeights) =>
      terms.zip(termWeights).map { case (term, weight) => (vocabArray(term.toInt), weight) }
    } <console>:96: error: constructor cannot be instantiated to expected type;  found   : (T1, T2)  required: org.apache.spark.sql.Row
       val topics = topicIndices.map { case (terms, termWeights) =>
                                            ^ <console>:97: error: not found: value terms
             terms.zip(termWeights).map { case (term, weight) => (vocabArray(term.toInt), weight) }
             ^

1 个答案:

答案 0 :(得分:2)

问题解决了。这是缺失的部分。一旦你从descrietopics获得df,这里的代码可以帮助获得相应的单词。 (注意:此代码适用于LDA的ml库)

val topicDF = model.describeTopics(maxTermsPerTopic = 10)
for ((row) <- topicDF) {
        val topicNumber = row.get(0)
        val topicTerms  = row.get(1)
        println ("Topic: "+ topicNumber)
}

import scala.collection.mutable.WrappedArray

val vocab = vectorizer.vocabulary

for ((row) <- topicDF) {
    val topicNumber = row.get(0)
    //val terms = row.get(1)
    val terms:WrappedArray[Int] = row.get(1).asInstanceOf[WrappedArray[Int]]
    for ((termIdx) <- 0 until 4) {
        println("Topic:" + topicNumber + " Word:" + vocab(termIdx))
    }
}

topicDF.printSchema
import org.apache.spark.sql.Row

topicDF.collect().foreach { r => 
                r match {
                        case _: Row => ("Topic:" + r)
                        case unknow => println("Something Else")
        }
}

topicDF.collect().foreach { r => {
                        println("Topic:" + r(0))
                        val terms:WrappedArray[Int] = r(1).asInstanceOf[WrappedArray[Int]]
                        terms.foreach {
                                t => {
                                        println("Term:" + vocab(t))
                                }
                        }
                }
        }