如何解决在数据集的列中应用舍入函数的错误(SparkException:任务不可序列化)

时间:2019-04-12 13:35:47

标签: scala apache-spark databricks

我开始在dataBricks笔记本中将Scala与Scala结合使用,但是出现一个奇怪的错误:

 SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: org.apache.spark.sql.Column
 Serialization stack:
- object not serializable (class: org.apache.spark.sql.Column, value: t020101)
- writeObject data (class: scala.collection.immutable.List$SerializationProxy)
- object (class scala.collection.immutable.List$SerializationProxy, scala.collection.immutable.List$SerializationProxy@1ccc6944)
- writeReplace data (class: scala.collection.immutable.List$SerializationProxy)
 ...

当我直接对值执行舍入函数时,代码工作正常:

 def timeUsageGroupedRound(summed: Dataset[TimeUsageRow]): Dataset[TimeUsageRow] = {

  summed.map{
       case TimeUsageRow(working, sex, age, primaryNeeds, work, other) => 
       TimeUsageRow(working, sex, age, (primaryNeeds* 10).round / 10d, (work* 10).round / 10d, (other* 10).round / 10d)
     }
   }

 val time_Usage_Round_DS = timeUsageGroupedRound(time_Usage_Grouped_DS)
 display(time_Usage_Round_DS)

但是,当我尝试执行辅助功能时,出现了上面提到的错误:

 def timeUsageGroupedRound(summed: Dataset[TimeUsageRow]): Dataset[TimeUsageRow] = {

  def round1(d:Double):Double = (d * 10).round / 10d

  summed.map{
       case TimeUsageRow(working, sex, age, primaryNeeds, work, other) => 
       TimeUsageRow(working, sex, age, round1(primaryNeeds), round1(work), round1(other))
     }
   }
 val time_Usage_Round_DS = timeUsageGroupedRound(time_Usage_Grouped_DS)
 display(time_Usage_Round_DS)

有人能解释为什么会这样吗?非常感谢!

1 个答案:

答案 0 :(得分:0)

简短答案1:

round1从您的类中移出,然后移入一个对象(也许使用同伴对象https://docs.scala-lang.org/tour/singleton-objects.html)。

简短答案2:

或者,将不在课堂上的Serializable以外的任何内容移出课堂(请参阅长答案)-尽管根据课堂的规模,这可能会很痛苦。

长答案:

这是一个有趣的事件,过去曾使我绊倒过几次。首先,当您在Dataset / DataFrame上执行.map时,实际上是地图中的所有内容-在您的情况下:

case TimeUsageRow(working, sex, age, primaryNeeds, work, other) => 
   TimeUsageRow(working, sex, age, round1(primaryNeeds), round1(work), round1(other))

被打包并从驱动程序发送给执行程序。由于Spark在驱动程序和执行程序之间进行通信的方式,您发送的所有内容都必须为Serializable。发生此错误的原因是,当包含round1时,它还会拖拽该类的其余部分,并且如果类中有不是Serializable的东西,则会出现此错误。