我有一个由pyspark.ml.clustering.LDA
创建的PySpark数据框。 topicDistribution
列是n个double的向量。
+--------------------+
| topicDistribution|
+--------------------+
|[0.93673575849807...|
|[0.31615978901762...|
|[0.33657712774309...|
|[0.30523697192979...|
+--------------------+
我想为每个向量列创建单独的(非向量)双列。最终,我尝试“ 解包”矢量,以便将数据写入CSV文件。
我尝试了多种方法。
方法1 是直接索引到列
for i in range(3):
df = df.withColumn("Col-" + str(i), df['topicDistribution'][i])
但这会产生错误
AnalysisException: u"Can't extract value from topicDistribution#858;"
方法2 尝试使用UDF,但是正如您所看到的,我的UDF没有传递矢量,而是传递了“点刺”。而且我不知道该怎么做。
getTyp = udf(lambda arr: getType(arr,1), StringType())
for i in range(3):
df = df.withColumn("Col-" + str(i), getTyp(df['topicDistribution']))
返回
+--------------------+--------------------+--------------------+--------------------+
| topicDistribution| Col-0| Col-1| Col-2|
+--------------------+--------------------+--------------------+--------------------+
|[0.93673577353151...|net.razorvine.pic...|net.razorvine.pic...|net.razorvine.pic...|
|[0.31615869274437...|net.razorvine.pic...|net.razorvine.pic...|net.razorvine.pic...|
|[0.33657583318666...|net.razorvine.pic...|net.razorvine.pic...|net.razorvine.pic...|
|[0.30523585516934...|net.razorvine.pic...|net.razorvine.pic...|net.razorvine.pic...|
+--------------------+--------------------+--------------------+--------------------+
方法3 使用的VectorSlicer接近,但所得列仍是向量。
for i in range(3):
slicer = VectorSlicer(inputCol="topicDistribution", outputCol="Col-" + str(i), indices=[i])
df = slicer.transform(df)
产生以下内容。注意,每一列仍然是一个向量([[]周围)
+--------------------+--------------------+--------------------+--------------------+
| topicDistribution| Col-0| Col-1| Col-2|
+--------------------+--------------------+--------------------+--------------------+
|[0.93673576108710...|[0.9367357610871071]|[0.03151327102122...|[0.03175096789167...|
|[0.31615848955402...|[0.31615848955402...|[0.3289336386324864]|[0.35490787181348...|
|[0.33657818512851...|[0.3365781851285112]|[0.32473902350327...|[0.3386827913682095]|
|[0.30523627602677...|[0.30523627602677...|[0.3426806504112193]|[0.3520830735620017]|
+--------------------+--------------------+--------------------+--------------------+
必须有一个简单的解决方案,但是我很困惑。