在Spark中使用UDF时任务序列化错误

时间:2018-09-08 11:12:20

标签: scala apache-spark apache-spark-sql user-defined-functions

如上所述创建UDF函数时,出现任务序列化错误。仅当我使用spark-submit在集群部署模式下运行代码时,才会出现此错误。但是,它在spark-shell中效果很好。

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import scala.collection.mutable.WrappedArray

def mfnURL(arr: WrappedArray[String]): String = {
  val filterArr = arr.filterNot(_ == null)
  if (filterArr.length == 0)
    return null
  else {
    filterArr.groupBy(identity).maxBy(_._2.size)._1
  }
}

val mfnURLUDF = udf(mfnURL _)

def windowSpec = Window.partitionBy("nodeId", "url", "typology")                                                     
val result = df.withColumn("count", count("url").over(windowSpec))
  .orderBy($"count".desc)                                                                                            
  .groupBy("nodeId","typology")                                                                                      
  .agg(
  first("url"),
  mfnURLUDF(collect_list("source_url")),
  min("minTimestamp"),
  max("maxTimestamp")
)

我尝试添加spark.udf.register("mfnURLUDF",mfnURLUDF),但是并不能解决问题。

1 个答案:

答案 0 :(得分:2)

您也可以尝试通过以下方式创建udf:

val mfnURL = udf { arr: WrappedArray[String] =>
  val filterArr = arr.filterNot(_ == null)
  if (filterArr.length == 0)
    return null
  else {
    filterArr.groupBy(identity).maxBy(_._2.size)._1
  }
}