在驱动程序代码中使用非可序列化对象时的序列化错误

时间:2015-06-10 14:29:24

标签: apache-spark

我正在使用Spark Streaming通过处理每个分区来处理流(将事件保存到HBase),然后确认每个RDD中从驱动程序到接收器的最后一个事件,这样接收器就可以依次将它连接到它的源

public class StreamProcessor {

  final AckClient ackClient;

  public StreamProcessor(AckClient ackClient) {
    this.ackClient = ackClient;
  }

  public void process(final JavaReceiverInputDStream<Event> inputDStream)
    inputDStream.foreachRDD(rdd -> {
      JavaRDD<Event> lastEvents = rdd.mapPartition(events -> {
        // ------ this code executes on the worker -------
        // process events one by one; I don't use ackClient here
        // return the event with the max delivery tag here
      });
      // ------ this code executes on the driver -------
      Event lastEvent = .. // find event with max delivery tag across partitions
      ackClient.ack(lastEvent); // use ackClient to ack last event
    });
  }
}

这里的问题是我收到以下错误(即使一切似乎都正常):

org.apache.spark.SparkException: Task not serializable
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
    at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
    at org.apache.spark.SparkContext.clean(SparkContext.scala:1435)
    at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:602)
    at org.apache.spark.api.java.JavaRDDLike$class.mapPartitions(JavaRDDLike.scala:141)
    at org.apache.spark.api.java.JavaRDD.mapPartitions(JavaRDD.scala:32)
...
Caused by: java.io.NotSerializableException: <some non-serializable object used by AckClient>
...

似乎Spark正在尝试序列化AckClient以将其发送给工作人员,但我认为只有mapPartitions内的代码被序列化/发送给工作人员,并且RDD中的代码级别(即在foreachRDD内但不在mapPartitions内)不会被序列化/运送给工人。

有人可以确认我的想法是否正确?如果它是正确的,应该将其报告为错误吗?

1 个答案:

答案 0 :(得分:1)

你是对的,这是在1.1中修复的。但是,如果查看堆栈跟踪,则会在mapPartitions

中调用正在抛出的清理程序
  

at org.apache.spark.SparkContext.clean(SparkContext.scala:1435)

     

at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:602)

因此,问题与您的mapPartitions有关。确保您不会意外包裹this,因为这是一个常见问题