当我尝试在PySpark中对Window函数应用UDF时,我遇到错误:“AnalysisException:窗口函数中不支持表达式”
例如:
sp = SparkSession(sc)
data = [(1,5,6,), (1,6,2,), (1,7,4,), (1,5,3,), (1,6,1), (1,7,5,), (2,2,5,), (2,3,3), (2,4,2,), (2,2,1,), (2,3,6,), (2,4,4,)]
df = sp.createDataFrame(data, ["acc", "val", "date"])
from pyspark.sql.window import Window
w = (Window.partitionBy(df.acc).orderBy(df.date).rangeBetween(-3,0))
def perform_some_operation(x):
return sum(x)
perform_some_operation_udf = udf(perform_some_operation, DoubleType())
df = df.withColumn('udf_of_val_over_w', perform_some_operation_udf(df['val']).over(w))
由于udf,抛出分析异常... 但是当我用pyspark.sql.functions中的函数替换udf时,它可以工作:
from pyspark.sql import functions as sqlf
df = df.withColumn('udf_of_val_over_w', sqlf.avg(df['val']).over(w))
有关如何在这种情况下使用UDF的任何想法?并提前致谢!!