将两个列表传递给pyspark中的pandas_udf?

时间:2018-12-09 07:10:35

标签: python apache-spark pyspark user-defined-functions

我正在尝试计算相应对之间的欧几里得距离。我尝试使用普通的udf,并且效果很好。我想尝试使用pandas_udf来加快速度。

@pandas_udf(T.FloatType(), PandasUDFType.SCALAR)
def calculate_euclidean_distance(feature1, feature2):
    from scipy.spatial import distance
    dist = float(distance.euclidean(feature1, feature2))
    return float(dist)

这是数据的外观。要素feature1和feature2列是两个相同维度的列表。

all_pairs_remove_same_pair_df.select("feature1", "feature2").show()

+--------------------+--------------------+
|            feature1|            feature2|
+--------------------+--------------------+
|[2.23668528E8, 1....|[2.23668528E8, 1....|
|[2.23668528E8, 1....|[2.23668528E8, 1....|
|[2.23668528E8, 1....|[2.23668528E8, 1....|
|[2.23668528E8, 1....|[2.23668528E8, 1....|
|[2.23668528E8, 1....|[2.23668528E8, 1....|

all_pairs_remove_same_pair_df.withColumn("distance", calculate_euclidean_distance(array(F.col("feature1"), F.col("feature2"))))

这是我遇到的错误-

TypeError: calculate_euclidean_distance() missing 1 required positional argument: 'feature2'

0 个答案:

没有答案