我正在尝试计算相应对之间的欧几里得距离。我尝试使用普通的udf,并且效果很好。我想尝试使用pandas_udf
来加快速度。
@pandas_udf(T.FloatType(), PandasUDFType.SCALAR)
def calculate_euclidean_distance(feature1, feature2):
from scipy.spatial import distance
dist = float(distance.euclidean(feature1, feature2))
return float(dist)
这是数据的外观。要素feature1和feature2列是两个相同维度的列表。
all_pairs_remove_same_pair_df.select("feature1", "feature2").show()
+--------------------+--------------------+
| feature1| feature2|
+--------------------+--------------------+
|[2.23668528E8, 1....|[2.23668528E8, 1....|
|[2.23668528E8, 1....|[2.23668528E8, 1....|
|[2.23668528E8, 1....|[2.23668528E8, 1....|
|[2.23668528E8, 1....|[2.23668528E8, 1....|
|[2.23668528E8, 1....|[2.23668528E8, 1....|
all_pairs_remove_same_pair_df.withColumn("distance", calculate_euclidean_distance(array(F.col("feature1"), F.col("feature2"))))
这是我遇到的错误-
TypeError: calculate_euclidean_distance() missing 1 required positional argument: 'feature2'