org.apache.spark.SparkException:任务不可序列化。 Scala Spark

时间:2020-04-30 17:52:15

标签: scala apache-spark

将现有应用程序从Spark 1.6移动到Spark 2.2 *(最终)带来了错误“ org.apache.spark.SparkException:任务不可序列化”。我过分简化了代码以演示相同的错误。该代码查询一个实木复合地板文件以返回以下数据类型:'org.apache.spark.sql.Dataset [org.apache.spark.sql.Row]'我应用了一个函数来提取字符串和整数,并返回一个字符串。一个固有的问题与Spark 2.2返回数据集而不是数据帧有关。 (有关先前的错误,请参见先前的帖子)How do I write a Dataset encoder to support mapping a function to a org.apache.spark.sql.Dataset[String] in Scala Spark

var d1 = hive.executeQuery(st)
d1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [cvdt35_message_id_d: string, cvdt35_input_timestamp_s: decimal(16,5) ... 2 more fields]

scala> val parseCVDP_parquet = (s:org.apache.spark.sql.Row) => s.getString(2).split("0x")(1)+","+s.getDecimal(1);
parseCVDP_parquet: org.apache.spark.sql.Row => String = <function1>

scala> var d2 =  d1.map(parseCVDP_parquet)
d2: org.apache.spark.sql.Dataset[String] = [value: string]

scala> def dd(s:String, start: Int) = { s + "some string" }
dd: (s: String, start: Int)String

scala> var d3 = d2.map{s=> dd(s,5) }
d3: org.apache.spark.sql.Dataset[String] = [value: string]

scala> d3.take(1)
org.apache.spark.SparkException: Task not serializable

我当前对这个问题的解决方案是通过嵌入内联代码(请参见下文),但是由于我的生产代码具有涉及的大量参数和功能,因此不切实际。我还尝试了转换为数据帧(如在spark 1.6中),并尝试了各种函数定义(尚未证明是可行的解决方案)。

scala> var d1 = hive.executeQuery(st)
d1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [cvdt35_message_id_d: string, cvdt35_input_timestamp_s: decimal(16,5) ... 2 more fields]

scala> val parseCVDP_parquet = (s:org.apache.spark.sql.Row) => s.getString(2).split("0x")(1)+","+s.getDecimal(1);
parseCVDP_parquet: org.apache.spark.sql.Row => String = <function1>

scala> var d2 =  d1.map(parseCVDP_parquet)
d2: org.apache.spark.sql.Dataset[String] = [value: string]

scala> var d3 = d2.map{s=> { s + "some string" } }
d3: org.apache.spark.sql.Dataset[String] = [value: string]

scala> d3.take(1)
20/04/30 15:16:17 WARN TaskSetManager: Stage 0 contains a task of very large size (132 KB). The maximum recommended task size is 100 KB.
res1: Array[String] = Array(761f006000705904,1521833533.96682some string)

1 个答案:

答案 0 :(得分:0)

org.apache.spark.SparkException: Task not serialization

要解决此问题,请将所有函数和变量放入Object中。在需要的地方使用这些函数和变量。

通过这种方式,您可以解决serialization的大部分问题

Example

package common
object AppFunctions {
  def append(s: String, start: Int) = s"${s}some thing"
}

object ExecuteQuery {
 import common.AppFunctions._

 [...]

 val d3 = d2.map(s => append(s,5)) // Pass required values to method.

 [...]


}