我发现在使用内存后端时具有检查点会导致观察到的延迟意外增加。
请考虑以下检查点:
2019-02-27 15:35:46,322 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 2 @ 1551281746322 for job a80597b3312f0704beed75397c371bf5.
2019-02-27 15:35:46,326 INFO org.apache.flink.runtime.state.heap.HeapKeyedStateBackend - Heap backend snapshot (In-Memory Stream Factory, synchronous part) in thread Thread[KeyedProcess -> Map -> Sink: Unnamed (1/1),5,Flink Task Threads] took 0 ms.
2019-02-27 15:35:46,342 INFO org.apache.flink.runtime.state.DefaultOperatorStateBackend - DefaultOperatorStateBackend snapshot (In-Memory Stream Factory, synchronous part) in thread Thread[Async calls on Source: Custom Source -> Map -> Timestamps/Watermarks (1/1),5,Flink Task Threads] took 2 ms.
2019-02-27 15:35:46,346 INFO org.apache.flink.runtime.state.DefaultOperatorStateBackend - DefaultOperatorStateBackend snapshot (In-Memory Stream Factory, asynchronous part) in thread Thread[pool-14-thread-2,5,Flink Task Threads] took 3 ms.
2019-02-27 15:35:46,351 INFO org.apache.flink.runtime.state.heap.HeapKeyedStateBackend - Heap backend snapshot (In-Memory Stream Factory, asynchronous part) in thread Thread[pool-11-thread-2,5,Flink Task Threads] took 14 ms.
2019-02-27 15:35:46,378 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed checkpoint 2 for job a80597b3312f0704beed75397c371bf5 (1157653 bytes in 54 ms).
即使端到端持续时间仅为50ms,但在15:35:46,385
注入事件的响应仅到达15:35:46,905
( 520ms后)。在这两个时间戳之间,未处理任何事件。无需检查点,延迟为99.99%约为15ms。
设置:
System.nanoTime
差异来测量延迟编辑:这是线性工作,所以我想检查点障碍没有对齐。
答案 0 :(得分:1)
时间花费在对RabbitMQ的消息进行ACK的同步确认(MessageAcknowledgingSourceBase#notifyCheckpointComplete
> MultipleIdsMessageAcknowledgingSourceBase#acknowledgeIDs
> RMQSource#acknowledgeSessionIDs
)中。可能像Kafka连接器那样异步进行。
因为我的检查点间隔是3分钟,并且我要注入200 ev / s,这意味着每个检查点都会触发对36k消息(200 * 60 * 3)的确认,这大约需要500毫秒。
使用较小的时间间隔可能有助于增加可预测的延迟时间,但会增加中值延迟时间。