如何解决非序列化异常?

时间:2015-09-25 06:02:24

标签: scala apache-spark

当我使用此代码时,它运行良好:

val result = rdd.filter(row =>
  row.get[DateTime]("eventtime") > Offset._1 && 
  row.get[DateTime]("eventtime") <= Offset._2)

但是,如果代码泛化,我会遇到“task not serialize”异常。

代码:

def resultFilter(offsetValue: (Imports.DateTime, Imports.DateTime)) = (x:CassandraRow) => {
  val date = x.get[DateTime]("mytime")
  date > offsetValue._1 && date <= offsetValue._2
}

及其用法(抛出错误)

rdd.filter(resultFilter(offsetValue))

输出:

org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166) ~[spark-core_2.10-1.2.2.2.jar:1.2.2.2]
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) ~[spark-core_2.10-1.2.2.2.jar:1.2.2.2]
at org.apache.spark.SparkContext.clean(SparkContext.scala:1476) ~[spark-core_2.10-1.2.2.2.jar:1.2.2.2]
at org.apache.spark.rdd.RDD.filter(RDD.scala:300) ~[spark-core_2.10-1.2.2.2.jar:1.2.2.2]
at com.aruba.sparkjobs.apprf.LeaderBoardJob.runJob(LeaderBoardJob.scala:203) ~[ee507b50-011f-42de-8bd5-536ca113d640-2015-09-25T11:11:23.637+05:30.jar:1.0.0-b.3]

如何序列化上述功能?

1 个答案:

答案 0 :(得分:0)

我要猜测元组参数Imports.DateTime中的每个offsetValue对象都引用了(不可序列化的)Imports对象。如果是这样的话,那么这样的事情可能会奏效:

def resultFilter(offsetValue: (Imports.DateTime, Imports.DateTime)) = {
  val offsetValue1 = offsetValue._1
  val offsetValue2 = offsetValue._2
  (x: CassandraRow) => {
    val date = x.get[DateTime]("mytime")
    date > offsetValue1 && date <= offsetValue2
  }
}