从PySpark数据框中的向量列中提取列

时间:2018-08-15 10:05:19

标签: python apache-spark vector pyspark

我有一个由pyspark.ml.clustering.LDA创建的PySpark数据框。 topicDistribution列是n个double的向量。

+--------------------+
|   topicDistribution|
+--------------------+
|[0.93673575849807...|
|[0.31615978901762...|
|[0.33657712774309...|
|[0.30523697192979...|
+--------------------+

我想为每个向量列创建单独的(非向量)双列。最终,我尝试“ 解包”矢量,以便将数据写入CSV文件。

我尝试了多种方法。
方法1 是直接索引到列

for i in range(3): 
    df = df.withColumn("Col-" + str(i), df['topicDistribution'][i])

但这会产生错误

AnalysisException: u"Can't extract value from topicDistribution#858;"

方法2 尝试使用UDF,但是正如您所看到的,我的UDF没有传递矢量,而是传递了“点刺”。而且我不知道该怎么做。

getTyp = udf(lambda arr: getType(arr,1), StringType())
for i in range(3): 
    df = df.withColumn("Col-" + str(i), getTyp(df['topicDistribution']))

返回

+--------------------+--------------------+--------------------+--------------------+
|   topicDistribution|               Col-0|               Col-1|               Col-2|
+--------------------+--------------------+--------------------+--------------------+
|[0.93673577353151...|net.razorvine.pic...|net.razorvine.pic...|net.razorvine.pic...|
|[0.31615869274437...|net.razorvine.pic...|net.razorvine.pic...|net.razorvine.pic...|
|[0.33657583318666...|net.razorvine.pic...|net.razorvine.pic...|net.razorvine.pic...|
|[0.30523585516934...|net.razorvine.pic...|net.razorvine.pic...|net.razorvine.pic...|
+--------------------+--------------------+--------------------+--------------------+

方法3 使用的VectorSlicer接近,但所得列仍是向量。

for i in range(3): 
    slicer = VectorSlicer(inputCol="topicDistribution", outputCol="Col-" + str(i), indices=[i])
    df = slicer.transform(df)

产生以下内容。注意,每一列仍然是一个向量([[]周围)

+--------------------+--------------------+--------------------+--------------------+
|   topicDistribution|               Col-0|               Col-1|               Col-2|
+--------------------+--------------------+--------------------+--------------------+
|[0.93673576108710...|[0.9367357610871071]|[0.03151327102122...|[0.03175096789167...|
|[0.31615848955402...|[0.31615848955402...|[0.3289336386324864]|[0.35490787181348...|
|[0.33657818512851...|[0.3365781851285112]|[0.32473902350327...|[0.3386827913682095]|
|[0.30523627602677...|[0.30523627602677...|[0.3426806504112193]|[0.3520830735620017]|
+--------------------+--------------------+--------------------+--------------------+

必须有一个简单的解决方案,但是我很困惑。

0 个答案:

没有答案