除了我正在使用Scala之外,我的问题确实与this相似。
对于以下代码:
roleList = res.select($"results", explode($"results").as("results_flat1"))
.select("results_flat1.*")
.select(explode($"rows").as("rows"))
.select($"rows".getItem(0).as("x"))
.withColumn("y", trim(col("x")))
.select($"y".as("ROLE_NAME"))
.map(row => Role(row.getAs[String](0)))
if (roleList.count() != 0) {
println(s"Number of Roles = ${roleList.count()}")
roleList.foreach{role =>
var status = ""
do {
val response = getDf()
response.show()
status = response.select("status").head().getString(0)
var error = ""
error = response.select($"results", explode($"results").as("results_flat1"))
.select("results_flat1.*")
.select($"error")
.first().get(0).asInstanceOf[String]
}
while (status != "completed")
}
}
我遇到以下异常:
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:403)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:393)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2326)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:926)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:925)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.foreach(RDD.scala:925)
at org.apache.spark.sql.Dataset$$anonfun$foreach$1.apply$mcV$sp(Dataset.scala:2716)
at org.apache.spark.sql.Dataset$$anonfun$foreach$1.apply(Dataset.scala:2716)
at org.apache.spark.sql.Dataset$$anonfun$foreach$1.apply(Dataset.scala:2716)
at org.apache.spark.sql.Dataset$$anonfun$withNewRDDExecutionId$1.apply(Dataset.scala:3349)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.Dataset.withNewRDDExecutionId(Dataset.scala:3345)
at org.apache.spark.sql.Dataset.foreach(Dataset.scala:2715)
at com.cloud.operations.RoleOperations.createRoles(RoleOperations.scala:30)
at com.cloud.Main$.main(Main.scala:24)
at com.cloud.Main.main(Main.scala)
Caused by: java.io.NotSerializableException: com.cloud.operations.RoleOperations
Serialization stack:
- object not serializable (class: com.cloud.operations.RoleOperations, value: com.cloud.operations.RoleOperations@67a3394c)
- field (class: com.cloud.operations.RoleOperations$$anonfun$createRoles$1, name: $outer, type: class com.cloud.operations.RoleOperations)
- object (class com.cloud.operations.RoleOperations$$anonfun$createRoles$1, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:400)
... 21 more
RoleOperations.scala:30
指roleList.foreach
开始的行。
我不确定为什么会这样。从链接的问题的答案来看,尽管getDf()
确实使用了spark.read.json
(来自SparkSession
),但是我并未在代码中的任何地方使用Spark Context。即使在那种情况下,异常也不会在该行发生,而是在其上方的行发生,这确实使我感到困惑。请对此提供帮助。
答案 0 :(得分:1)
首先,您不能在执行程序上执行的函数中使用spark会话。 SparkSession
仅可用于驱动程序代码。
对于您而言,roleList.foreach
中的所有内容都是在执行程序上执行,而不是在驱动程序上执行。
另外,当有人使用在执行者代码内的类中定义的变量时,可能会出现相同的错误。在这种情况下,需要将整个类发送给执行者,如果它不能序列化,则会出现此错误。