我有一个flink作业,它卡在创建检查点中。它几乎没有状态(除了一些kafka偏移量)。
作业本身具有以下基本设置:
KafkaSource -> iterate -> HDFSSink
iterate
函数再次进行HTTP调用并转发成功,丢弃4xx并重试5xx。
从我的指标中可以看到,所有这一切都发生了,我得到了一些5xx(回到迭代源),一些4xx(忽略)和很多2xx(转发到HDFS)。
如果我查看线程转储,我可以看到某个任务被阻止:
"Async calls on IterationSource-8 (1/1)" #123 daemon prio=5 os_prio=0 tid=0x00007f174000f800 nid=0x237 waiting for monitor entry [0x00007f17b32f5000]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:747)
- waiting to lock <0x00000000ace0f128> (a java.lang.Object)
at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:683)
at org.apache.flink.runtime.taskmanager.Task$1.run(Task.java:1155)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
此人正在等待由以下人员持有的对象监视器:
"IterationSource-8 (1/1)" #63 prio=5 os_prio=0 tid=0x00007f17c00bf000 nid=0x1e0 in Object.wait() [0x00007f17b17d2000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestMemorySegment(LocalBufferPool.java:256)
- locked <0x00000000acd030b0> (a java.util.ArrayDeque)
at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBuilderBlocking(LocalBufferPool.java:213)
at org.apache.flink.runtime.io.network.partition.ResultPartition.getBufferBuilder(ResultPartition.java:181)
at org.apache.flink.runtime.io.network.api.writer.RecordWriter.requestNewBufferBuilder(RecordWriter.java:256)
at org.apache.flink.runtime.io.network.api.writer.RecordWriter.copyFromSerializerToTargetChannel(RecordWriter.java:184)
at org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:154)
at org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:120)
at org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter(RecordWriterOutput.java:107)
at org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:89)
at org.apache.flink.streaming.runtime.tasks.StreamIterationHead.performDefaultAction(StreamIterationHead.java:77)
- locked <0x00000000ace0f128> (a java.lang.Object)
at org.apache.flink.streaming.runtime.tasks.StreamTask.run(StreamTask.java:298)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:403)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
at java.lang.Thread.run(Thread.java:748)
仔细查看源代码,我可以看到第二个线程(持有锁)似乎处于某种无限循环中:
LocalBufferPool.java:
while (availableMemorySegments.isEmpty()) {
}
亲爱的Flink专家,您有什么线索可以看待哪个指标?我正在使用Flink 1.9.0。
在此先感谢您的提示!
答案 0 :(得分:0)
在Flink Sink中使用HTTP调用时,我遇到了类似的检查点。经过大量的跟踪和错误后,我发现,如果每秒的接收速率低于输入速率,检查点就会被击中。
为此,我将source(input)的并行度指定为1,将HTTP调用的并行度指定为8。
这将在等待HTTP响应时不阻塞线程,以便检查点发生。我也是Flink的新手,想请一些专家解释为什么在flink中使用HTTP调用时检查点会变慢。