Question

我在这样的pyspark数据帧中得到了LDA的结果：

topicIndices.filter("topic > 3").show(10, truncate=True)
+-----+--------------------+--------------------+
|topic|         termIndices|         termWeights|
+-----+--------------------+--------------------+
|    4| [27, 56, 29, 46, 6]|[0.01826416604834...|
|    5| [63, 4, 36, 31, 21]|[0.01900143131755...|
|    6|[40, 60, 16, 36, 50]|[0.01915052744093...|
|    7|   [5, 59, 4, 8, 29]|[0.05513279495368...|
|    8| [52, 17, 10, 46, 2]|[0.01903217569516...|
|    9|     [0, 1, 3, 7, 6]|[0.13563252276342...|
+-----+--------------------+--------------------+

我试图将这些词语替换为术语索引，以便检查主题。我想做的是：

topics = topicIndices \
    .rdd \
    .map(lambda x: vocabList[y] for y in x[1].zip(x[2]))

但是我收到了错误：

NameError: name 'x' is not defined

我在这里做错了什么？

实际上，它是此Scala代码的Python版本：

val topics = topicIndices.map { case (terms, termWeights) =>
                terms.map(vocabList(_)).zip(termWeights)
             }

来自this dataBricks post

Answer 1

你的lambda表达式应该括在括号中，即：

topics = topicIndices \
    .rdd \
    .map(lambda x: ( vocabList[y] for y in x[1].zip(x[2]) ) )

UPDATE（评论后）：你显然是在尝试使用PySpark zip，但是它作为参数RDD而不是列表。我猜测（因为你没有提供你想要的结果的例子，更不用说vocabList函数本身了）你需要standard Python zip function，它有不同的用法： / p>

topics = topicIndices \
    .rdd \
    .map(lambda x: ( vocabList[y] for y in zip(x[1],x[2]) ) )

如何将原始单词附加到pyspark数据框中的LDA结果集

1 个答案: