从Pyspark LDA模型中提取文档主题矩阵

时间:2015-10-12 02:37:27

标签: python apache-spark pyspark lda

我已经通过Python API成功地在spark中训练了LDA模型:

from pyspark.mllib.clustering import LDA
model=LDA.train(corpus,k=10)

这完全正常,但我现在需要LDA模型的文档 -topic矩阵,但据我所知,我能得到的是 word -topic,使用model.topicsMatrix()

是否有某种方法可以从LDA模型中获取文档主题矩阵,如果没有,是否有另一种方法(除了从零开始实现LDA)以运行一个LDA模型,该模型将给出结果I需要?

编辑:

在挖掘了一下后,我在Java api中找到了DistributedLDAModel的文档,其中topicDistributions()我认为正是我需要的In [127]: model.call('topicDistributions') Out[127]: MapPartitionsRDD[3156] at mapPartitions at PythonMLLibAPI.scala:1480 (但我100%确定Pyspark中的LDAModel实际上是引擎盖下的DistributedLDAModel ......)。

在任何情况下,我都可以像这样间接调用这个方法,而不会有任何明显的失败:

In [128]: model.call('topicDistributions').take(5)
Out[128]:
[{u'__class__': u'scala.Tuple2'},
 {u'__class__': u'scala.Tuple2'},
 {u'__class__': u'scala.Tuple2'},
 {u'__class__': u'scala.Tuple2'},
 {u'__class__': u'scala.Tuple2'}]

但是如果我真的看结果,我得到的只是字符串告诉我结果实际上是Scala元组(我认为):

import numpy as np
import pylab as pl
import matplotlib.pyplot as py

class Radar(object):

def __init__(self, fig, titles, labels, rect=None):
    if rect is None:
        rect = [0.05, 0.05, 0.95, 0.95]

    self.n = len(titles)
    self.angles = [a if a <=360. else a - 360. for a in np.arange(90, 90+360, 360.0/self.n)]
    self.axes = [fig.add_axes(rect, projection="polar", label="axes%d" % i) 
                     for i in range(self.n)]

    self.ax = self.axes[0]
    self.ax.set_thetagrids(self.angles, labels=titles, fontsize=12, weight="bold", color="black")

    for ax in self.axes[1:]:
        ax.patch.set_visible(False)
        ax.grid("off")
        ax.xaxis.set_visible(False)
        self.ax.yaxis.grid(False)

    for ax, angle, label in zip(self.axes, self.angles, labels):
        ax.set_rgrids(range(1, 7), labels=label, angle=angle, fontsize=12)
        ax.spines["polar"].set_visible(False)
        ax.set_ylim(0, 6)  
        ax.xaxis.grid(True,color='black',linestyle='-')

def plot(self, values, *args, **kw):
    angle = np.deg2rad(np.r_[self.angles, self.angles[0]])
    values = np.r_[values, values[0]]
    self.ax.plot(angle, values, *args, **kw)

fig = pl.figure(figsize=(20, 20))

titles = [
"Canada", "Australia", "New Zealand", "Japan", "China", "USA", "Mexico", "Finland", "Doha" 
]

labels = [
list("abcde"), list("12345"), list("uvwxy"), 
[" ", " ", "$156", "$158", "$160"],
list("jklmn"), list("asdfg"), list("qwert"), [" ", "4.3", "4.4", "4.5", "4.6"], list("abcde")
]

radar = Radar(fig, titles, labels)
radar.plot([1, 3, 2, 5, 4, 5, 3, 3, 2],  "--", lw=1, color="b", alpha=.5, label="USA 2014")
radar.plot([2.3, 2, 3, 3, 2, 3, 2, 4, 2],"-", lw=1, color="r", alpha=.5, label="2014")
radar.plot([3, 4, 3, 4, 2, 2, 1, 3, 2], "-", lw=1, color="g", alpha=.5, label="2013")
radar.plot([4.5, 5, 4, 5, 3, 3, 4, 4, 2], "-", lw=1, color="y", alpha=.5, label="2012")

radar.ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.10),
      fancybox=True, shadow=True, ncol=4)

fig = py.gcf()
fig.set_size_inches(6, 10, forward=True)
fig.savefig('test2png.png', dpi=100, bbox_inches="tight", pad_inches=1)

也许这通常是正确的方法,但有没有办法获得实际结果?

3 个答案:

答案 0 :(得分:6)

经过广泛的研究,这绝对不可能通过当前版本的Spark(1.5.1)上的Python api实现。但是在Scala中,它是相当简单的(给定一个RDD <button class="btn btn-sm btn-warning btn--edit" type="button" data-toggle="modal" data-target="#update" data-for="{{$user->id}}">Edit&nbsp;<i class="glyphicon glyphicon-edit"></i></button> 来训练):

documents

然后获取文档主题分发就像:

import org.apache.spark.mllib.clustering.{LDA, DistributedLDAModel}

// first generate RDD of documents...

val numTopics = 10
val lda = new LDA().setK(numTopics).setMaxIterations(10)
val ldaModel = lda.run(documents)

# then convert to distributed LDA model
val distLDAModel = ldaModel.asInstanceOf[DistributedLDAModel]

答案 1 :(得分:4)

以下内容扩展了PySpark和Spark 2.0的上述响应。

我希望你能原谅我发布这个回复而不是评论,但我现在缺乏代表。

我假设你有一个训练有素的LDA模型,这样的语料库是这样的:

documents = spark.createDataFrame([
    [123myNumericId, Vectors.sparse(len(words_in_our_corpus), {index_of_word:count}],
    [2, Vectors.sparse(len(words_in_our_corpus), {index_of_word:count, another:1.0}],
], schema=["id", "features"]
transformed = ldaModel.transform(documents)
dist = transformed.take(1)
# dist[0]['topicDistribution'] is now a dense vector of our topics.

要将文档转换为主题分布,我们会创建文档ID的数据框和单词的矢量(稀疏性更好)。

d = {
        "section": {
                   "heading":{"lvl":"A1", "text":"today"},
                   "htext":[
                                {"color":"green",  "text":"yesterday", "htext":["a","b","c"]},
                                {"color":"purple", "text":"tomorrow"}
                               ]
                   }
         }

答案 2 :(得分:3)

从Spark 2.0开始,您可以使用transform()作为pyspark.ml.clustering.DistributedLDAModel的方法。我刚从scikit-learn的20个新闻组数据集中尝试了这个,并且它有效。请参阅返回的vectors,它是文档主题的分布。

>>> test_results = ldaModel.transform(wordVecs)
Row(filename='/home/jovyan/work/data/20news_home/20news-bydate-test/rec.autos/103343', target=7, text='I am a little confused on all of the models of the 88-89 bonnevilles.\nI have heard of the LE SE LSE SSE SSEI. Could someone tell me the\ndifferences are far as features or performance. I am also curious to\nknow what the book value is for prefereably the 89 model. And how much\nless than book value can you usually get them for. In other words how\nmuch are they in demand this time of year. I have heard that the mid-spring\nearly summer is the best time to buy.', tokens=['little', 'confused', 'models', 'bonnevilles', 'someone', 'differences', 'features', 'performance', 'curious', 'prefereably', 'usually', 'demand', 'spring', 'summer'], vectors=SparseVector(10977, {28: 1.0, 29: 1.0, 152: 1.0, 301: 1.0, 496: 1.0, 552: 1.0, 571: 1.0, 839: 1.0, 1114: 1.0, 1281: 1.0, 1288: 1.0, 1624: 1.0}), topicDistribution=DenseVector([0.0462, 0.0538, 0.045, 0.0473, 0.0545, 0.0487, 0.0529, 0.0535, 0.0467, 0.0549, 0.051, 0.0466, 0.045, 0.0487, 0.0482, 0.0509, 0.054, 0.0472, 0.0547, 0.0501]))