我在这样的pyspark数据帧中得到了LDA的结果:
topicIndices.filter("topic > 3").show(10, truncate=True)
+-----+--------------------+--------------------+
|topic| termIndices| termWeights|
+-----+--------------------+--------------------+
| 4| [27, 56, 29, 46, 6]|[0.01826416604834...|
| 5| [63, 4, 36, 31, 21]|[0.01900143131755...|
| 6|[40, 60, 16, 36, 50]|[0.01915052744093...|
| 7| [5, 59, 4, 8, 29]|[0.05513279495368...|
| 8| [52, 17, 10, 46, 2]|[0.01903217569516...|
| 9| [0, 1, 3, 7, 6]|[0.13563252276342...|
+-----+--------------------+--------------------+
我试图将这些词语替换为术语索引,以便检查主题。我想做的是:
topics = topicIndices \
.rdd \
.map(lambda x: vocabList[y] for y in x[1].zip(x[2]))
但是我收到了错误:
NameError: name 'x' is not defined
我在这里做错了什么?
实际上,它是此Scala代码的Python版本:
val topics = topicIndices.map { case (terms, termWeights) =>
terms.map(vocabList(_)).zip(termWeights)
}
答案 0 :(得分:0)
你的lambda
表达式应该括在括号中,即:
topics = topicIndices \
.rdd \
.map(lambda x: ( vocabList[y] for y in x[1].zip(x[2]) ) )
UPDATE(评论后):你显然是在尝试使用PySpark zip
,但是它作为参数RDD而不是列表。我猜测(因为你没有提供你想要的结果的例子,更不用说vocabList
函数本身了)你需要standard Python zip function,它有不同的用法: / p>
topics = topicIndices \
.rdd \
.map(lambda x: ( vocabList[y] for y in zip(x[1],x[2]) ) )