以下代码在2.4.0之前的Spark版本(2. *)中正常工作
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object MyApp extends App {
val spark = SparkSession.builder
.appName("udf check").master("local[*]").getOrCreate
import spark.implicits._
val initDf = spark.read
.option("delimiter", "|")
.csv("input.txt")
.select($"_c0".alias("person"), split($"_c1", ",").alias("friends"))
//udf's
val reverse_friends_name = udf((friends: Seq[String]) => friends.map(_.reverse))
val flatten = udf((listOfFriends: Seq[Seq[String]]) => listOfFriends.flatten.toList)
initDf.groupBy("person").agg(reverse_friends_name(flatten(collect_set("friends")))).show
}
下面是输入
sam|jenny,miller
miller|joe
sam|carl
joe|frank
生成的输出:
+------+------------------------------------+
|person|UDF(UDF(collect_set(friends, 0, 0)))|
+------+------------------------------------+
|miller| [eoj]|
| joe| [knarf]|
| sam| [ynnej, rellim, l...|
+------+------------------------------------+
但是使用Spark 2.4.0时,以下代码会中断
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object MyApp extends App {
val spark = SparkSession.builder
.appName("udf check").master("local[*]").getOrCreate
import spark.implicits._
val initDf = spark.read
.option("delimiter", "|")
.csv("input.txt")
.select($"_c0".alias("person"), split($"_c1", ",").alias("friends"))
//udf
val reverse_friends_name = udf((friends: Seq[String]) => friends.map(_.reverse))
initDf.groupBy("person").agg(reverse_friends_name(flatten(collect_set("friends")))).show
}
产生以下错误
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: scala.runtime.LazyRef
Serialization stack:
- object not serializable (class: scala.runtime.LazyRef, value: LazyRef thunk)
- element of array (index: 2)
- array (class [Ljava.lang.Object;, size 3)
- field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)
- object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class org.apache.spark.sql.catalyst.expressions.ScalaUDF, functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic org/apache/spark/sql/catalyst/expressions/ScalaUDF.$anonfun$f$2:(Lscala/Function1;Lorg/apache/spark/sql/catalyst/expressions/Expression;Lscala/runtime/LazyRef;Lorg/apache/spark/sql/catalyst/InternalRow;)Ljava/lang/Object;, instantiatedMethodType=(Lorg/apache/spark/sql/catalyst/InternalRow;)Ljava/lang/Object;, numCaptured=3])
- writeReplace data (class: java.lang.invoke.SerializedLambda)
- object (class org.apache.spark.sql.catalyst.expressions.ScalaUDF$$Lambda$1841/822958001, org.apache.spark.sql.catalyst.expressions.ScalaUDF$$Lambda$1841/822958001@e097c13)
- field (class: org.apache.spark.sql.catalyst.expressions.ScalaUDF, name: f, type: interface scala.Function1)
- object (class org.apache.spark.sql.catalyst.expressions.ScalaUDF, UDF(flatten(collect_set(friends#15, 0, 0)#20)))
- field (class: org.apache.spark.sql.catalyst.expressions.Alias, name: child, type: class org.apache.spark.sql.catalyst.expressions.Expression)
- object (class org.apache.spark.sql.catalyst.expressions.Alias, UDF(flatten(collect_set(friends#15, 0, 0)#20)) AS UDF(flatten(collect_set(friends, 0, 0)))#21)
- element of array (index: 1)
- array (class [Ljava.lang.Object;, size 2)
- field (class: scala.collection.mutable.ArrayBuffer, name: array, type: class [Ljava.lang.Object;)
- object (class scala.collection.mutable.ArrayBuffer, ArrayBuffer(person#14, UDF(flatten(collect_set(friends#15, 0, 0)#20)) AS UDF(flatten(collect_set(friends, 0, 0)))#21))
- field (class: org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec, name: resultExpressions, type: interface scala.collection.Seq)
- object (class org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec, ObjectHashAggregate(keys=[person#14], functions=[collect_set(friends#15, 0, 0)], output=[person#14, UDF(flatten(collect_set(friends, 0, 0)))#21])
我找不到太多有关此功能的文档。是否为了支持添加的集合功能而将其删除了?
答案 0 :(得分:0)
如果您的代码抛出上述错误,请将您的 Spark 库版本从 2.12 更改为 2.11.. 一切都会运行。
在我的情况下,我使用 3.x 版本的 spark,其中 2.12 是默认的,我用 2.11 切换到 2.4,一切正常......