Scala中的数据集forEach循环抛出SparkException任务无法序列化

时间:2019-12-11 07:48:53

标签: scala apache-spark apache-spark-sql apache-spark-dataset

除了我正在使用Scala之外,我的问题确实与this相似。

对于以下代码:

        roleList = res.select($"results", explode($"results").as("results_flat1"))
                        .select("results_flat1.*")
                        .select(explode($"rows").as("rows"))
                        .select($"rows".getItem(0).as("x"))
                        .withColumn("y", trim(col("x")))
                        .select($"y".as("ROLE_NAME"))
                        .map(row => Role(row.getAs[String](0)))

        if (roleList.count() != 0) {
            println(s"Number of Roles = ${roleList.count()}")

            roleList.foreach{role =>
                var status = ""

                do {
                    val response = getDf()
                    response.show()

                    status = response.select("status").head().getString(0)
                    var error = ""

                    error = response.select($"results", explode($"results").as("results_flat1"))
                                .select("results_flat1.*")
                                .select($"error")
                                .first().get(0).asInstanceOf[String]
                }
                while (status != "completed")
            }
        }

我遇到以下异常:

Exception in thread "main" org.apache.spark.SparkException: Task not serializable
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:403)
    at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:393)
    at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
    at org.apache.spark.SparkContext.clean(SparkContext.scala:2326)
    at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:926)
    at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:925)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
    at org.apache.spark.rdd.RDD.foreach(RDD.scala:925)
    at org.apache.spark.sql.Dataset$$anonfun$foreach$1.apply$mcV$sp(Dataset.scala:2716)
    at org.apache.spark.sql.Dataset$$anonfun$foreach$1.apply(Dataset.scala:2716)
    at org.apache.spark.sql.Dataset$$anonfun$foreach$1.apply(Dataset.scala:2716)
    at org.apache.spark.sql.Dataset$$anonfun$withNewRDDExecutionId$1.apply(Dataset.scala:3349)
    at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
    at org.apache.spark.sql.Dataset.withNewRDDExecutionId(Dataset.scala:3345)
    at org.apache.spark.sql.Dataset.foreach(Dataset.scala:2715)
    at com.cloud.operations.RoleOperations.createRoles(RoleOperations.scala:30)
    at com.cloud.Main$.main(Main.scala:24)
    at com.cloud.Main.main(Main.scala)
Caused by: java.io.NotSerializableException: com.cloud.operations.RoleOperations
Serialization stack:
    - object not serializable (class: com.cloud.operations.RoleOperations, value: com.cloud.operations.RoleOperations@67a3394c)
    - field (class: com.cloud.operations.RoleOperations$$anonfun$createRoles$1, name: $outer, type: class com.cloud.operations.RoleOperations)
    - object (class com.cloud.operations.RoleOperations$$anonfun$createRoles$1, <function1>)
    at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:400)
    ... 21 more

RoleOperations.scala:30roleList.foreach开始的行。

我不确定为什么会这样。从链接的问题的答案来看,尽管getDf()确实使用了spark.read.json(来自SparkSession),但是我并未在代码中的任何地方使用Spark Context。即使在那种情况下,异常也不会在该行发生,而是在其上方的行发生,这确实使我感到困惑。请对此提供帮助。

1 个答案:

答案 0 :(得分:1)

首先,您不能在执行程序上执行的函数中使用spark会话。 SparkSession仅可用于驱动程序代码。

对于您而言,roleList.foreach中的所有内容都是在执行程序上执行,而不是在驱动程序上执行。

另外,当有人使用在执行者代码内的类中定义的变量时,可能会出现相同的错误。在这种情况下,需要将整个类发送给执行者,如果它不能序列化,则会出现此错误。