Question

背景

以下是我的情况：我正在尝试根据内容的某些功能创建一个过滤RDD的类，但该功能在不同场景中可能会有所不同，所以我想要用函数参数化。不幸的是，我似乎遇到了Scala捕获其闭包的问题。即使我的函数是可序列化的，但类不是。

从spark source on closure cleaning中的示例来看，它似乎表明我的情况无法解决，但我确信有一种方法可以实现我尝试的目标通过创建正确（较小）的闭包来做。

我的代码

class MyFilter(getFeature: Element => String, other: NonSerializable) {
  def filter(rdd: RDD[Element]): RDD[Element] = {
    // All my complicated logic I want to share
    rdd.filter { elem => getFeature(elem) == "myTargetString" }     
}

简化示例

class Foo(f: Int => Double, rdd: RDD[Int]) { 
  def go(data: RDD[Int]) = data.map(f) 
}

val works = new Foo(_.toDouble, otherRdd)
works.go(myRdd).collect() // works

val myMap = Map(1 -> 10d)
val complicatedButSerializableFunc: Int => Double = x => myMap.getOrElse(x, 0)
val doesntWork = new Foo(complicatedButSerializableFunc, otherRdd)
doesntWork.go(myRdd).collect() // craps out

org.apache.spark.SparkException: Task not serializable
Caused by: java.io.NotSerializableException: $iwC$$iwC$Foo
Serialization stack:
    - object not serializable (class: $iwC$$iwC$Foo, value: $iwC$$iwC$Foo@61e33118)
    - field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, name: foo, type: class $iwC$$iwC$Foo)
    - object (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC@47d6a31a)
    - field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1, name: $outer, type: class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC)
    - object (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1, <function1>)

// Even though
val out = new ObjectOutputStream(new FileOutputStream("test.obj"))
out.writeObject(complicatedButSerializableFunc) // works

问题

为什么第一个简化示例不会尝试序列化所有Foo，但第二个示例呢？
如何在我的闭包中不包含对Foo的引用的情况下获取对可序列化函数的引用？

Answer 1

在this article的帮助下找到答案。

基本上，当为给定函数创建闭包时，Scala将包含引用的任何复杂字段的整个对象（如果有人对第一个简单示例中没有出现这种情况的原因有很好的解释，我＆＃39 ;接受那个答案）。解决方案是将可序列化值传递给不同的函数，以便只保留最小参考值，非常类似于ol＆＃39;事件监听器的javascript for-loop范例。

示例

def enclose[E, R](enclosed: E)(func: E => R): R = func(enclosed) class Foo(f: Int => Double, somethingNonserializable: RDD[String]) { def go(data: RDD[Int]) = enclose(f) { actualFunction => data.map(actualFunction) } }

或使用JS风格的自动执行匿名函数

def go(data: RDD[Int]) = ((actualFunction: Int => Double) => data.map(actualFunction))(f)

使用匿名函数时Spark TaskNotSerializable

1 个答案: