Spark - 'LDAModel'对象没有属性'describeTopics'

时间:2016-12-21 08:45:02

标签: apache-spark pyspark apache-spark-mllib apache-spark-ml

我目前正在使用CDH Spark 1.5.0,Python 2.6.6 Hadoop 2.6

我正在尝试通过引用此链接Saprk1.5.0-Latent Dirichlet allocation (LDA)

来构建LDA模型

引用文档中的内容:

  

MLlib的所有LDA模型都支持:

     
      
  • describeTopics:将主题作为最重要术语和术语权重的数组返回
  •   
  • topicsMatrix:以k矩阵形式返回vocabSize,其中每列是主题
  •   

我想在describeTopics模式下实现LDA

代码(可重复):

from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vectors
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext, Row

conf = SparkConf().setAppName("test").set("spark.executor.memory", "512m")
sc = SparkContext(conf = conf)
sc.setLogLevel('ERROR')
sqlContext = SQLContext(sc)


# Load and parse the data
data = sc.parallelize([[0,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,1,1,0,1],[0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,1,0,0,0,0,0,2,0,0,0,0,0,1,0,0,0,0],[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0],[0,1,1,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0,0,1,0,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,1,0,1,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,1,1,0,0,0,1,0,0,0,1,1,0,0,1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0],[1,0,0,0,1,0,0,0,0,0,1,0,0,1,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,1,0,1,0,0,0,0,1,0]])
#data = sc.textFile("file://data.txt")

parsedData = data.map(lambda line: Vectors.dense([float(x) for x in line.strip().split(' ')]))
# Index documents with unique IDs
corpus = parsedData.zipWithIndex().map(lambda x: [x[1], x[0]]).cache()

# Cluster the documents into three topics using LDA
ldaModel = LDA.train(corpus, k=3)

# Output topics. Each is a distribution over words (matching word count vectors)
print("Learned topics (as distributions over vocab of " + str(ldaModel.vocabSize()) + " words):")
#topics = ldaModel.topicsMatrix()
topics = ldaModel.describeTopics(maxTermsPerTopic = 10)

for topic in range(3):
    print("Topic " + str(topic) + ":")
    for word in range(0, ldaModel.vocabSize()):
    print(" " + str(topics[word][topic]))

但是我收到以下错误

AttributeError: 'LDAModel' object has no attribute 'describeTopics'

Spark不支持describeTopics?这里有什么遗漏吗?

1 个答案:

答案 0 :(得分:1)

这是预期的行为。 PySpark MLLib中的var server = require('./server'); var router = require('./router'); var requestHandlers = require('./requestHandlers'); var handle = {}; handle['/'] = requestHandlers.view; handle['/view'] = requestHandlers.view; handle['/create'] = requestHandlers.create; // server.start(router.route, requestHandlers.handle); server.start(router.route, handle); 已在Spark 1.6中引入: