我正在使用Spark Streaming通过处理每个分区来处理流(将事件保存到HBase),然后确认每个RDD中从驱动程序到接收器的最后一个事件,这样接收器就可以依次将它连接到它的源
public class StreamProcessor {
final AckClient ackClient;
public StreamProcessor(AckClient ackClient) {
this.ackClient = ackClient;
}
public void process(final JavaReceiverInputDStream<Event> inputDStream)
inputDStream.foreachRDD(rdd -> {
JavaRDD<Event> lastEvents = rdd.mapPartition(events -> {
// ------ this code executes on the worker -------
// process events one by one; I don't use ackClient here
// return the event with the max delivery tag here
});
// ------ this code executes on the driver -------
Event lastEvent = .. // find event with max delivery tag across partitions
ackClient.ack(lastEvent); // use ackClient to ack last event
});
}
}
这里的问题是我收到以下错误(即使一切似乎都正常):
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1435)
at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:602)
at org.apache.spark.api.java.JavaRDDLike$class.mapPartitions(JavaRDDLike.scala:141)
at org.apache.spark.api.java.JavaRDD.mapPartitions(JavaRDD.scala:32)
...
Caused by: java.io.NotSerializableException: <some non-serializable object used by AckClient>
...
似乎Spark正在尝试序列化AckClient
以将其发送给工作人员,但我认为只有mapPartitions
内的代码被序列化/发送给工作人员,并且RDD中的代码级别(即在foreachRDD
内但不在mapPartitions
内)不会被序列化/运送给工人。
有人可以确认我的想法是否正确?如果它是正确的,应该将其报告为错误吗?
答案 0 :(得分:1)
你是对的,这是在1.1中修复的。但是,如果查看堆栈跟踪,则会在mapPartitions
at org.apache.spark.SparkContext.clean(SparkContext.scala:1435)
at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:602)
因此,问题与您的mapPartitions
有关。确保您不会意外包裹this
,因为这是一个常见问题