如何从LDA模型中获取vocabArray(org.apache.spark.ml.clustering.LDA)。我刚刚收到vocabSize,它返回扫描的字数。
理想情况下,我需要模型中的实际单词数组,然后根据termindices我想要查看存储桶中的单词。
我需要在scala中执行此操作。任何建议都会有所帮助。
到目前为止我尝试过的事情,我的主题是数据框
topicIndices: org.apache.spark.sql.DataFrame = [topic: int, termIndices: array<int>, termWeights: array<double>]
我正在尝试获取此类主题
val topics = topicIndices.map { case (terms, termWeights) =>
terms.zip(termWeights).map { case (term, weight) => (vocabArray(term.toInt), weight) }
}
但它会引发以下错误
>
val topics = topicIndices.map { case (terms, termWeights) =>
terms.zip(termWeights).map { case (term, weight) => (vocabArray(term.toInt), weight) }
} <console>:96: error: constructor cannot be instantiated to expected type; found : (T1, T2) required: org.apache.spark.sql.Row
val topics = topicIndices.map { case (terms, termWeights) =>
^ <console>:97: error: not found: value terms
terms.zip(termWeights).map { case (term, weight) => (vocabArray(term.toInt), weight) }
^
答案 0 :(得分:2)
问题解决了。这是缺失的部分。一旦你从descrietopics获得df,这里的代码可以帮助获得相应的单词。 (注意:此代码适用于LDA的ml库)
val topicDF = model.describeTopics(maxTermsPerTopic = 10)
for ((row) <- topicDF) {
val topicNumber = row.get(0)
val topicTerms = row.get(1)
println ("Topic: "+ topicNumber)
}
import scala.collection.mutable.WrappedArray
val vocab = vectorizer.vocabulary
for ((row) <- topicDF) {
val topicNumber = row.get(0)
//val terms = row.get(1)
val terms:WrappedArray[Int] = row.get(1).asInstanceOf[WrappedArray[Int]]
for ((termIdx) <- 0 until 4) {
println("Topic:" + topicNumber + " Word:" + vocab(termIdx))
}
}
topicDF.printSchema
import org.apache.spark.sql.Row
topicDF.collect().foreach { r =>
r match {
case _: Row => ("Topic:" + r)
case unknow => println("Something Else")
}
}
topicDF.collect().foreach { r => {
println("Topic:" + r(0))
val terms:WrappedArray[Int] = r(1).asInstanceOf[WrappedArray[Int]]
terms.foreach {
t => {
println("Term:" + vocab(t))
}
}
}
}