为什么检查点会对延迟产生如此大的影响?

时间:2019-02-27 15:48:02

标签: apache-flink flink-streaming

我发现在使用内存后端时具有检查点会导致观察到的延迟意外增加。

请考虑以下检查点:

2019-02-27 15:35:46,322 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering checkpoint 2 @ 1551281746322 for job a80597b3312f0704beed75397c371bf5.
2019-02-27 15:35:46,326 INFO  org.apache.flink.runtime.state.heap.HeapKeyedStateBackend     - Heap backend snapshot (In-Memory Stream Factory, synchronous part) in thread Thread[KeyedProcess -> Map -> Sink: Unnamed (1/1),5,Flink Task Threads] took 0 ms.
2019-02-27 15:35:46,342 INFO  org.apache.flink.runtime.state.DefaultOperatorStateBackend    - DefaultOperatorStateBackend snapshot (In-Memory Stream Factory, synchronous part) in thread Thread[Async calls on Source: Custom Source -> Map -> Timestamps/Watermarks (1/1),5,Flink Task Threads] took 2 ms.
2019-02-27 15:35:46,346 INFO  org.apache.flink.runtime.state.DefaultOperatorStateBackend    - DefaultOperatorStateBackend snapshot (In-Memory Stream Factory, asynchronous part) in thread Thread[pool-14-thread-2,5,Flink Task Threads] took 3 ms.
2019-02-27 15:35:46,351 INFO  org.apache.flink.runtime.state.heap.HeapKeyedStateBackend     - Heap backend snapshot (In-Memory Stream Factory, asynchronous part) in thread Thread[pool-11-thread-2,5,Flink Task Threads] took 14 ms.
2019-02-27 15:35:46,378 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Completed checkpoint 2 for job a80597b3312f0704beed75397c371bf5 (1157653 bytes in 54 ms).

即使端到端持续时间仅为50ms,但在15:35:46,385注入事件的响应仅到达15:35:46,905 520ms后)。在这两个时间戳之间,未处理任何事件。无需检查点,延迟为99.99%约为15ms。

设置:

  • 平行主义= 1
  • 网络缓冲区= 0
  • RMQ源->窗口-> RMQ接收器
  • 注入器使用注入和接收响应之间的System.nanoTime差异来测量延迟

编辑:这是线性工作,所以我想检查点障碍没有对齐。

1 个答案:

答案 0 :(得分:1)

时间花费在对RabbitMQ的消息进行ACK的同步确认(MessageAcknowledgingSourceBase#notifyCheckpointComplete> MultipleIdsMessageAcknowledgingSourceBase#acknowledgeIDs> RMQSource#acknowledgeSessionIDs)中。可能像Kafka连接器那样异步进行。

因为我的检查点间隔是3分钟,并且我要注入200 ev / s,这意味着每个检查点都会触发对36k消息(200 * 60 * 3)的确认,这大约需要500毫秒。

使用较小的时间间隔可能有助于增加可预测的延迟时间,但会增加中值延迟时间。