在uds中导入模块或在pyspark中将udf导入外部有什么区别?

时间:2018-12-09 16:07:30

标签: python apache-spark pyspark user-defined-functions

我正在尝试计算两列之间的欧几里得距离,两列都有浮点数列表。我尝试通过两种方法使用pandas_udf进行计算-一种在函数内部导入,另一种在函数外部导入。 第一种方法-

@pandas_udf(T.FloatType(), PandasUDFType.SCALAR)
def calculate_euclidean_distance(feature1, feature2):
    features_df = pd.DataFrame({"feature1": feature1, "feature2": feature2})
    features_df["euclidean_distance"] = features_df.apply(lambda x: distance.euclidean(x["feature1"], x["feature2"]), axis=1)
    return features_df["euclidean_distance"]

第二种方法-

@pandas_udf(T.FloatType(), PandasUDFType.SCALAR)
def calculate_euclidean_distance(feature1, feature2):
    from scipy.spatial import distance
    import pandas as pd
    features_df = pd.DataFrame({"feature1": feature1, "feature2": feature2})
    features_df["euclidean_distance"] = features_df.apply(lambda x: distance.euclidean(x["feature1"], x["feature2"]), axis=1)
    return

features_df [“ euclidean_distance”]

他们两个都在我本地的spark设置中工作。我想知道这两种方法有什么区别?

0 个答案:

没有答案