如何将原始单词附加到pyspark数据框中的LDA结果集

时间:2017-11-17 23:46:20

标签: pyspark

我在这样的pyspark数据帧中得到了LDA的结果:

topicIndices.filter("topic > 3").show(10, truncate=True)
+-----+--------------------+--------------------+
|topic|         termIndices|         termWeights|
+-----+--------------------+--------------------+
|    4| [27, 56, 29, 46, 6]|[0.01826416604834...|
|    5| [63, 4, 36, 31, 21]|[0.01900143131755...|
|    6|[40, 60, 16, 36, 50]|[0.01915052744093...|
|    7|   [5, 59, 4, 8, 29]|[0.05513279495368...|
|    8| [52, 17, 10, 46, 2]|[0.01903217569516...|
|    9|     [0, 1, 3, 7, 6]|[0.13563252276342...|
+-----+--------------------+--------------------+

我试图将这些词语替换为术语索引,以便检查主题。我想做的是:

topics = topicIndices \
    .rdd \
    .map(lambda x: vocabList[y] for y in x[1].zip(x[2]))

但是我收到了错误:

NameError: name 'x' is not defined

我在这里做错了什么?

实际上,它是此Scala代码的Python版本:

val topics = topicIndices.map { case (terms, termWeights) =>
                terms.map(vocabList(_)).zip(termWeights)
             }

来自this dataBricks post

1 个答案:

答案 0 :(得分:0)

你的lambda表达式应该括在括号中,即:

topics = topicIndices \
    .rdd \
    .map(lambda x: ( vocabList[y] for y in x[1].zip(x[2]) ) )

UPDATE(评论后):你显然是在尝试使用PySpark zip,但是它作为参数RDD而不是列表。我猜测(因为你没有提供你想要的结果的例子,更不用说vocabList函数本身了)你需要standard Python zip function,它有不同的用法: / p>

topics = topicIndices \
    .rdd \
    .map(lambda x: ( vocabList[y] for y in zip(x[1],x[2]) ) )