我有一个看起来像这样的数据框:
+--------------------+--------------------++-------------
| feature1| feature2| domain |
+--------------------+--------------------++-------------
|[2.23668528E8, 1....|[2.23668528E8, 1....| domain1 |
|[2.23668528E8, 1....|[2.23668528E8, 1....| domain2 |
|[2.23668528E8, 1....|[2.23668528E8, 1....| domain1 |
|[2.23668528E8, 1....|[2.23668528E8, 1....| domain2 |
|[2.23668528E8, 1....|[2.23668528E8, 1....| domain1 |
数据框的架构是这样的:
|-- domain: string (nullable = true)
|-- feature1: vector (nullable = true)
|-- feature2: vector (nullable = true)
列feature1和feature2的类型为Vector.Dense。我想计算平方距离。我尝试过这种方式:
all_pairs_df.withColumn(
"distance",
Vectors.squared_distance(all_pairs_df.feature2, all_pairs_df.feature1)
).show()
但是我遇到了错误,任何想法如何在pyspark中实现而无需使用最终调用BatchEvalPython的udf?