如何在pyspark中的两列上调用平方距离函数?

时间:2018-12-13 06:01:26

标签: python apache-spark pyspark apache-spark-sql partition

我有一个看起来像这样的数据框:

+--------------------+--------------------++-------------
|            feature1|            feature2| domain    |
+--------------------+--------------------++-------------
|[2.23668528E8, 1....|[2.23668528E8, 1....| domain1   | 
|[2.23668528E8, 1....|[2.23668528E8, 1....| domain2   |
|[2.23668528E8, 1....|[2.23668528E8, 1....| domain1   |
|[2.23668528E8, 1....|[2.23668528E8, 1....| domain2   |
|[2.23668528E8, 1....|[2.23668528E8, 1....| domain1   |

数据框的架构是这样的:

|-- domain: string (nullable = true)
|-- feature1: vector (nullable = true)
|-- feature2: vector (nullable = true)

列feature1和feature2的类型为Vector.Dense。我想计算平方距离。我尝试过这种方式:

all_pairs_df.withColumn(
    "distance",
    Vectors.squared_distance(all_pairs_df.feature2, all_pairs_df.feature1)
).show()

但是我遇到了错误,任何想法如何在pyspark中实现而无需使用最终调用BatchEvalPython的udf?

0 个答案:

没有答案