在PySpark中的Window函数上应用UDF

时间:2018-01-17 00:48:46

标签: apache-spark pyspark spark-dataframe

当我尝试在PySpark中对Window函数应用UDF时,我遇到错误:“AnalysisException:窗口函数中不支持表达式”

例如:

sp = SparkSession(sc)
data = [(1,5,6,), (1,6,2,), (1,7,4,), (1,5,3,), (1,6,1), (1,7,5,), (2,2,5,), (2,3,3), (2,4,2,), (2,2,1,), (2,3,6,), (2,4,4,)]
df = sp.createDataFrame(data, ["acc", "val", "date"])

from pyspark.sql.window import Window
w = (Window.partitionBy(df.acc).orderBy(df.date).rangeBetween(-3,0))

def perform_some_operation(x):
    return sum(x)
perform_some_operation_udf = udf(perform_some_operation, DoubleType())

df = df.withColumn('udf_of_val_over_w', perform_some_operation_udf(df['val']).over(w))
由于udf,

抛出分析异常... 但是当我用pyspark.sql.functions中的函数替换udf时,它可以工作:

from pyspark.sql import functions as sqlf
df = df.withColumn('udf_of_val_over_w', sqlf.avg(df['val']).over(w))

有关如何在这种情况下使用UDF的任何想法?并提前致谢!!

0 个答案:

没有答案