Question

我的代码：

val result= rdd.filter(x=> x.get[DateTime]("mytime") > offsetvalue._1 && 
             row.get[DateTime]("mytime") <= offsetvalue._2)

我想压缩代码：

val result = rdd.filter(x => myFunction())
where myFunction() {x=> x.get[DateTime]("mytime") > offsetvalue._1 && 
             row.get[DateTime]("mytime") <= offsetvalue._2 }

当myFunction被调用时，它显示为exeception：

org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166) ~[spark-core_2.10-1.2.2.2.jar:1.2.2.2]
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) ~[spark-core_2.10-1.2.2.2.jar:1.2.2.2]
at org.apache.spark.SparkContext.clean(SparkContext.scala:1476) ~[spark-core_2.10-1.2.2.2.jar:1.2.2.2]
at org.apache.spark.rdd.RDD.filter(RDD.scala:300) ~[spark-core_2.10-1.2.2.2.jar:1.2.2.2]
at com.aruba.sparkjobs.apprf.LeaderBoardJob.runJob(LeaderBoardJob.scala:203) ~[ee507b50-011f-42de-8bd5-536ca113d640-2015-09-25T11:11:23.637+05:30.jar:1.0.0-b.3]

如何序列化上述功能

Answer 1

像

这样的东西

def resultFilter(offsetValue: (A, A)) = (x: B) => {
  val date = x.get[DateTime]("mytime")
  date > offsetValue._1 && date <= offsetValue._2
}

rdd.filter(resultFilter(offsetValue))

您必须填写A和B，因为您的问题中没有足够的信息来推断它们。

Answer 2

这不是你问题的直接答案，但你可以通过这种方式使你的表达更具可读性：

val (min, max) = offsetValue
val result = rdd.map(_.get[DateTime]("mytime"))
  .filter(t => t > min && t <= max)

以下是您问题的直接答案：

def myFun(x: YourType): Boolean = {
  val (min, max) = (dateTime1, dateTime2) // the values from offsetValue, assuming they are constant
  val t = x.get[DateTime]("mytime")
  t > min && t <= max
}

然后将其称为

val res = rdd.filter(myFun)

如何在scala中压缩以下内容？

2 个答案: