使用UDF的DataFrame导致Task不可序列化异常

时间:2019-09-01 05:00:54

标签: scala apache-spark

尝试在数据框上使用show()方法。它使Task不可序列化异常。

我尝试扩展Serializable对象,但错误仍然存​​在。

object App extends Serializable{
  def main(args: Array[String]): Unit = {

    Logger.getLogger("org.apache").setLevel(Level.WARN);

    val spark = SparkSession.builder()
      .appName("LearningSpark")
      .master("local[*]")
      .getOrCreate()
    val sc = spark.sparkContext
    val inputPath = "./src/resources/2015-03-01-0.json"
    val ghLog = spark.read.json(inputPath)
    val pushes = ghLog.filter("type = 'PushEvent'")
    val grouped = pushes.groupBy("actor.login").count
    val ordered = grouped.orderBy(grouped("count").desc)
    ordered.show(5)
    val empPath = "./src/resources/ghEmployees.txt"
    val employees = Set() ++ (
      for {
        line <- fromFile(empPath).getLines
      } yield line.trim)
    val bcEmployees = sc.broadcast(employees)
    import spark.implicits._
    val isEmp = user => bcEmployees.value.contains(user)
    val isEmployee = spark.udf.register("SetContainsUdf", isEmp)
    val filtered = ordered.filter(isEmployee($"login"))
    filtered.show()
  }
}

使用Spark的默认log4j配置文件:

org/apache/spark/log4j-defaults.properties
19/09/01 10:21:48 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:403)
    at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:393)
    at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
    at org.apache.spark.SparkContext.clean(SparkContext.scala:2326)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$1(RDD.scala:850)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
    at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:849)
    at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:630)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.prepareShuffleDependency(ShuffleExchangeExec.scala:92)
    at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.$anonfun$doExecute$1(ShuffleExchangeExec.scala:128)
    at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
    at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.doExecute(ShuffleExchangeExec.scala:119)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:391)
    at org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:151)
    at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:627)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:136)
    at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3383)
    at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2544)
    at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:3364)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
    at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364)
    at org.apache.spark.sql.Dataset.head(Dataset.scala:2544)
    at org.apache.spark.sql.Dataset.take(Dataset.scala:2758)
    at org.apache.spark.sql.Dataset.getRows(Dataset.scala:254)
    at org.apache.spark.sql.Dataset.showString(Dataset.scala:291)
    at org.apache.spark.sql.Dataset.show(Dataset.scala:745)
    at org.apache.spark.sql.Dataset.show(Dataset.scala:704)
    at org.apache.spark.sql.Dataset.show(Dataset.scala:713)
    at App$.main(App.scala:33)
    at App.main(App.scala)
Caused by: java.io.NotSerializableException: scala.runtime.LazyRef
Serialization stack:
    - object not serializable (class: scala.runtime.LazyRef, value: LazyRef thunk)
    - element of array (index: 2)
    - array (class [Ljava.lang.Object;, size 3)
    - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)
    - object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class org.apache.spark.sql.catalyst.expressions.ScalaUDF, functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic org/apache/spark/sql/catalyst/expressions/ScalaUDF.$anonfun$f$2:(Lscala/Function1;Lorg/apache/spark/sql/catalyst/expressions/Expression;Lscala/runtime/LazyRef;Lorg/apache/spark/sql/catalyst/InternalRow;)Ljava/lang/Object;, instantiatedMethodType=(Lorg/apache/spark/sql/catalyst/InternalRow;)Ljava/lang/Object;, numCaptured=3])
    - writeReplace data (class: java.lang.invoke.SerializedLambda)
    - object (class org.apache.spark.sql.catalyst.expressions.ScalaUDF$$Lambda$2364/2031154005, org.apache.spark.sql.catalyst.expressions.ScalaUDF$$Lambda$2364/2031154005@1fd37440)
    - field (class: org.apache.spark.sql.catalyst.expressions.ScalaUDF, name: f, type: interface scala.Function1)
    - object (class org.apache.spark.sql.catalyst.expressions.ScalaUDF, UDF:SetContainsUdf(actor#6.login))
    - writeObject data (class: scala.collection.immutable.List$SerializationProxy)
    - object (class scala.collection.immutable.List$SerializationProxy, scala.collection.immutable.List$SerializationProxy@3b65084e)
    - writeReplace data (class: scala.collection.immutable.List$SerializationProxy)
    - object (class scala.collection.immutable.$colon$colon, List(isnotnull(type#13), (type#13 = PushEvent), UDF:SetContainsUdf(actor#6.login)))
    - field (class: org.apache.spark.sql.execution.FileSourceScanExec, name: dataFilters, type: interface scala.collection.Seq)
    - object (class org.apache.spark.sql.execution.FileSourceScanExec, FileScan json [actor#6,type#13] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/C:/Users/abhaydub/Scala-Spark-workspace/LearningSpark/src/resources/2015-..., PartitionFilters: [], PushedFilters: [IsNotNull(type), EqualTo(type,PushEvent)], ReadSchema: struct<actor:struct<avatar_url:string,gravatar_id:string,id:bigint,login:string,url:string>,type:...
)
    - field (class: org.apache.spark.sql.execution.FilterExec, name: child, type: class org.apache.spark.sql.execution.SparkPlan)
    - object (class org.apache.spark.sql.execution.FilterExec, Filter ((isnotnull(type#13) && (type#13 = PushEvent)) && UDF:SetContainsUdf(actor#6.login))
+- FileScan json [actor#6,type#13] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/C:/Users/abhaydub/Scala-Spark-workspace/LearningSpark/src/resources/2015-..., PartitionFilters: [], PushedFilters: [IsNotNull(type), EqualTo(type,PushEvent)], ReadSchema: struct<actor:struct<avatar_url:string,gravatar_id:string,id:bigint,login:string,url:string>,type:...
)
    - field (class: org.apache.spark.sql.execution.ProjectExec, name: child, type: class org.apache.spark.sql.execution.SparkPlan)
    - object (class org.apache.spark.sql.execution.ProjectExec, Project [actor#6]
+- Filter ((isnotnull(type#13) && (type#13 = PushEvent)) && UDF:SetContainsUdf(actor#6.login))
   +- FileScan json [actor#6,type#13] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/C:/Users/abhaydub/Scala-Spark-workspace/LearningSpark/src/resources/2015-..., PartitionFilters: [], PushedFilters: [IsNotNull(type), EqualTo(type,PushEvent)], ReadSchema: struct<actor:struct<avatar_url:string,gravatar_id:string,id:bigint,login:string,url:string>,type:...
)
    - field (class: org.apache.spark.sql.execution.aggregate.HashAggregateExec, name: child, type: class org.apache.spark.sql.execution.SparkPlan)
    - object (class org.apache.spark.sql.execution.aggregate.HashAggregateExec, HashAggregate(keys=[actor#6.login AS actor#6.login#53], functions=[partial_count(1)], output=[actor#6.login#53, count#43L])
+- Project [actor#6]
   +- Filter ((isnotnull(type#13) && (type#13 = PushEvent)) && UDF:SetContainsUdf(actor#6.login))
      +- FileScan json [actor#6,type#13] Batched:+------------------+-----+
|             login|count|
+------------------+-----+
|      greatfirebot|  192|
|diversify-exp-user|  146|
|     KenanSulayman|   72|
|        manuelrp07|   45|
|    mirror-updates|   42|
+------------------+-----+
only showing top 5 rows

 false, Format: JSON, Location: InMemoryFileIndex[file:/C:/Users/abhaydub/Scala-Spark-workspace/LearningSpark/src/resources/2015-..., PartitionFilters: [], PushedFilters: [IsNotNull(type), EqualTo(type,PushEvent)], ReadSchema: struct<actor:struct<avatar_url:string,gravatar_id:string,id:bigint,login:string,url:string>,type:...
)
    - element of array (index: 0)
    - array (class [Ljava.lang.Object;, size 14)
    - element of array (index: 1)
    - array (class [Ljava.lang.Object;, size 3)
    - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)
    - object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class org.apache.spark.sql.execution.WholeStageCodegenExec, functionalInterfaceMethod=scala/Function2.apply:(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic org/apache/spark/sql/execution/WholeStageCodegenExec.$anonfun$doExecute$4$adapted:(Lorg/apache/spark/sql/catalyst/expressions/codegen/CodeAndComment;[Ljava/lang/Object;Lorg/apache/spark/sql/execution/metric/SQLMetric;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, instantiatedMethodType=(Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;, numCaptured=3])
    - writeReplace data (class: java.lang.invoke.SerializedLambda)
    - object (class org.apache.spark.sql.execution.WholeStageCodegenExec$$Lambda$1297/815648243, org.apache.spark.sql.execution.WholeStageCodegenExec$$Lambda$1297/815648243@27438750)
    at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:400)
    ... 48 more

3 个答案:

答案 0 :(得分:1)

序列化问题不是因为object不是Serializableobject被序列化并发送给执行者以执行,而转换代码被序列化了。

代码中的功能之一不可序列化。 在查看代码和跟踪时,isEmployee似乎是问题所在。 几点观察
1. isEmployee不是UDF。在Spark中,需要通过扩展org.apache.spark.sql.expressions.UserDefinedFunction的{​​{1}}来创建UDF,并且在定义函数后,需要使用Serializable来注册

我可以想到两种解决方案: 1.正确创建和注册UDF,以便正确进行序列化
2.完全避免使用UDF,并使用广播变量和过滤方法,如下所示

org.apache.spark.sql.UDFRegistration#register

答案 1 :(得分:1)

我使用Scala“ 2.12.1”获得了Spark 2.4.4。我遇到了同样的问题(对象不可序列化(类:scala.runtime.LazyRef,值:LazyRef thunk)),这使我发疯。我将Scala版本更改为“ 2.12.10”,问题已解决!

答案 2 :(得分:0)

生活中充满着奥秘。序列化就是其中之一,而星火壳和Databricks笔记本的某些方面则更容易。

https://medium.com/onzo-tech/serialization-challenges-with-spark-and-scala-a2287cd51c54应该被咨询,以确保扩展顶层提供的Serializable并不是线索。据我了解,驾驶员会将相关物品运送给执行人员。

  • 如果我在Databricks Notebook中按原样运行您的代码,而没有任何可序列化的扩展,则可以正常使用!过去,我一直能够在Databricks笔记本中捕获序列化问题-一直以来都是如此。有趣的是,就像在伪集群中一样,我应该在发行之前就解决所有可能的序列化问题-显然并非总是如此。有趣,但是笔记本不是火花提交的。

  • 如果我使用两个连续的“粘贴模式”在spark-shell中运行-逻辑或如下所示逐行,请在下面和1)省略一些内容,并2)使用扩展Serializable适应用于您UDF的对象-用于列,因此我们坚持使用它,可以正常工作。

  

:粘贴1

scala> :paste

// Entering paste mode (ctrl-D to finish)

object X extends Serializable {
    val isEmp = user => bcEmployees.value.contains(user)  
}  
  

:粘贴2

import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder()
  .appName("LearningSpark")
  .master("local[*]")
  .getOrCreate()
val sc = spark.sparkContext

// Register UDF
val isEmployee = udf(X.isEmp)

import scala.io.Source
import spark.implicits._

// Simulated input.
val ghLog = Seq(("john2X0", "push"), ("james09", "abc"), ("peter01", "push"), ("mary99", "push"), ("peter01", "push")).toDF("login", "type")
val pushes = ghLog.filter("type = 'push'")
val grouped = pushes.groupBy("login").count
val ordered = grouped.orderBy(grouped("count").desc)
ordered.show(5)

val emp = "/home/mapr/emp.txt"    
val employees = Set() ++ (
  for {
    line <- Source.fromFile(emp).getLines
  } yield line.trim)
val bcEmployees = sc.broadcast(employees)

val filtered = ordered.filter(isEmployee($"login"))
filtered.show()

因此,其他答案状态不是通过UDF实现的,在某些情况下,性能更高,但是我坚持使用UDF,它允许输入列并且可以重用。这种方法也适用于火花提交,尽管应该很明显-为后代提到。