我无法使用窗口功能调用UDF。
from pyspark.sql.window import Window
from pyspark.sql import functions as F
mst=spark.createDataFrame([(1,"v1" ), (2,"v1"), (3,"v1" ),(21,"v2" ), (22,"v2"), (31,"v3" )], ["mst_id","mst_val"])
ref=spark.createDataFrame([(91,"v1" ), (92,"v2"), (93,"v3" )], ["ref_id","ref_val"])
定义了一个简单的功能
def fnc1 (val):
w=Window().partitionBy("mst_val").orderBy(F.col("mst_id").asc())
mtch=mst.withColumn("rank",F.row_number().over(w)).filter((F.col("rank") == 1) & (F.col("mst_val") == F.lit(val))).rdd.collect()
return (mtch[0]['mst_id'] if len(mtch) else -1)
fnc1("v3")
收益率31
定义了一个简单的UDF
from pyspark.sql.functions import udf, col
from pyspark.sql.types import *
udf1 = udf(lambda r: fnc1(r),IntegerType())
调用udf会出错。
ref.withColumn("abc",udf1(col("ref_val")))
给出错误:py4j.Py4JException:方法__getnewargs __([])不存在
谁能帮助我。谢谢!