我正在尝试计算两列之间的欧几里得距离,两列都有浮点数列表。我尝试通过两种方法使用pandas_udf进行计算-一种在函数内部导入,另一种在函数外部导入。 第一种方法-
@pandas_udf(T.FloatType(), PandasUDFType.SCALAR)
def calculate_euclidean_distance(feature1, feature2):
features_df = pd.DataFrame({"feature1": feature1, "feature2": feature2})
features_df["euclidean_distance"] = features_df.apply(lambda x: distance.euclidean(x["feature1"], x["feature2"]), axis=1)
return features_df["euclidean_distance"]
第二种方法-
@pandas_udf(T.FloatType(), PandasUDFType.SCALAR)
def calculate_euclidean_distance(feature1, feature2):
from scipy.spatial import distance
import pandas as pd
features_df = pd.DataFrame({"feature1": feature1, "feature2": feature2})
features_df["euclidean_distance"] = features_df.apply(lambda x: distance.euclidean(x["feature1"], x["feature2"]), axis=1)
return
features_df [“ euclidean_distance”]
他们两个都在我本地的spark设置中工作。我想知道这两种方法有什么区别?