将Spark检查点写入S3太慢

时间:2016-05-02 17:10:59

标签: amazon-s3 apache-spark

我正在使用Spark Streaming 1.5.2,我正在使用Direct Stream方法从Kafka 0.8.2.2中提取数据。

我启用了检查点,以便我的驱动程序可以重新启动并从中断处继续,而不会丢失未处理的数据。

检查点写入S3,因为我在Amazon AWS上,而不是在Hadoop集群上运行。

批处理间隔是1秒,因为我想要低延迟。

问题是,将单个检查点写入S3需要1到20秒。它们在内存中备份,最终应用程序失败。

2016-04-28 18:26:55,483 INFO  [org.apache.spark.streaming.CheckpointWriter] [pool-16-thread-1] - Checkpoint for time 1461882407000 ms saved to file 's3a://.../checkpoints/cxp-filter/checkpoint-1461882407000', took 6071 bytes and 1724 ms
2016-04-28 18:26:58,812 INFO  [org.apache.spark.streaming.CheckpointWriter] [pool-16-thread-1] - Checkpoint for time 1461882407000 ms saved to file 's3a://.../checkpoints/cxp-filter/checkpoint-1461882407000', took 6024 bytes and 3329 ms
2016-04-28 18:27:00,327 INFO  [org.apache.spark.streaming.CheckpointWriter] [pool-16-thread-1] - Checkpoint for time 1461882408000 ms saved to file 's3a://.../checkpoints/cxp-filter/checkpoint-1461882408000', took 6068 bytes and 1515 ms
2016-04-28 18:27:06,667 INFO  [org.apache.spark.streaming.CheckpointWriter] [pool-16-thread-1] - Checkpoint for time 1461882408000 ms saved to file 's3a://.../checkpoints/cxp-filter/checkpoint-1461882408000', took 6024 bytes and 6340 ms
2016-04-28 18:27:11,689 INFO  [org.apache.spark.streaming.CheckpointWriter] [pool-16-thread-1] - Checkpoint for time 1461882409000 ms saved to file 's3a://.../checkpoints/cxp-filter/checkpoint-1461882409000', took 6067 bytes and 5022 ms
2016-04-28 18:27:15,982 INFO  [org.apache.spark.streaming.CheckpointWriter] [pool-16-thread-1] - Checkpoint for time 1461882409000 ms saved to file 's3a://.../checkpoints/cxp-filter/checkpoint-1461882409000', took 6024 bytes and 4293 ms

有没有办法在不增加批处理间隔的情况下增加检查点之间的间隔?

1 个答案:

答案 0 :(得分:0)

是的,您可以使用checkpointInterval参数实现此目的。您可以在执行检查点时设置持续时间,如下面的doc所示。

  

请注意,RDD的检查点会导致节省可靠存储的成本。这可能会导致RDD被检查点的那些批次的处理时间增加。因此,需要仔细设置检查点的间隔。在小批量(例如1秒)时,每批次的检查点可能会显着降低操作吞吐量。相反,检查点过于频繁会导致谱系和任务大小增长,这可能会产生不利影响。对于需要RDD检查点的有状态转换,默认间隔是批处理间隔的倍数,至少为10秒。可以使用dstream.checkpoint(checkpointInterval)进行设置。通常,DStream的5-10个滑动间隔的检查点间隔是一个很好的设置。