我的代码:
val result= rdd.filter(x=> x.get[DateTime]("mytime") > offsetvalue._1 &&
row.get[DateTime]("mytime") <= offsetvalue._2)
我想压缩代码:
val result = rdd.filter(x => myFunction())
where myFunction() {x=> x.get[DateTime]("mytime") > offsetvalue._1 &&
row.get[DateTime]("mytime") <= offsetvalue._2 }
当myFunction被调用时,它显示为exeception:
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166) ~[spark-core_2.10-1.2.2.2.jar:1.2.2.2]
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) ~[spark-core_2.10-1.2.2.2.jar:1.2.2.2]
at org.apache.spark.SparkContext.clean(SparkContext.scala:1476) ~[spark-core_2.10-1.2.2.2.jar:1.2.2.2]
at org.apache.spark.rdd.RDD.filter(RDD.scala:300) ~[spark-core_2.10-1.2.2.2.jar:1.2.2.2]
at com.aruba.sparkjobs.apprf.LeaderBoardJob.runJob(LeaderBoardJob.scala:203) ~[ee507b50-011f-42de-8bd5-536ca113d640-2015-09-25T11:11:23.637+05:30.jar:1.0.0-b.3]
如何序列化上述功能
答案 0 :(得分:2)
像
这样的东西def resultFilter(offsetValue: (A, A)) = (x: B) => {
val date = x.get[DateTime]("mytime")
date > offsetValue._1 && date <= offsetValue._2
}
rdd.filter(resultFilter(offsetValue))
您必须填写A
和B
,因为您的问题中没有足够的信息来推断它们。
答案 1 :(得分:1)
这不是你问题的直接答案,但你可以通过这种方式使你的表达更具可读性:
val (min, max) = offsetValue
val result = rdd.map(_.get[DateTime]("mytime"))
.filter(t => t > min && t <= max)
以下是您问题的直接答案:
def myFun(x: YourType): Boolean = {
val (min, max) = (dateTime1, dateTime2) // the values from offsetValue, assuming they are constant
val t = x.get[DateTime]("mytime")
t > min && t <= max
}
然后将其称为
val res = rdd.filter(myFun)