当我使用此代码时,它运行良好:
val result = rdd.filter(row =>
row.get[DateTime]("eventtime") > Offset._1 &&
row.get[DateTime]("eventtime") <= Offset._2)
但是,如果代码泛化,我会遇到“task not serialize”异常。
代码:
def resultFilter(offsetValue: (Imports.DateTime, Imports.DateTime)) = (x:CassandraRow) => {
val date = x.get[DateTime]("mytime")
date > offsetValue._1 && date <= offsetValue._2
}
及其用法(抛出错误)
rdd.filter(resultFilter(offsetValue))
输出:
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166) ~[spark-core_2.10-1.2.2.2.jar:1.2.2.2]
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) ~[spark-core_2.10-1.2.2.2.jar:1.2.2.2]
at org.apache.spark.SparkContext.clean(SparkContext.scala:1476) ~[spark-core_2.10-1.2.2.2.jar:1.2.2.2]
at org.apache.spark.rdd.RDD.filter(RDD.scala:300) ~[spark-core_2.10-1.2.2.2.jar:1.2.2.2]
at com.aruba.sparkjobs.apprf.LeaderBoardJob.runJob(LeaderBoardJob.scala:203) ~[ee507b50-011f-42de-8bd5-536ca113d640-2015-09-25T11:11:23.637+05:30.jar:1.0.0-b.3]
如何序列化上述功能?
答案 0 :(得分:0)
我要猜测元组参数Imports.DateTime
中的每个offsetValue
对象都引用了(不可序列化的)Imports
对象。如果是这样的话,那么这样的事情可能会奏效:
def resultFilter(offsetValue: (Imports.DateTime, Imports.DateTime)) = {
val offsetValue1 = offsetValue._1
val offsetValue2 = offsetValue._2
(x: CassandraRow) => {
val date = x.get[DateTime]("mytime")
date > offsetValue1 && date <= offsetValue2
}
}