Kafka集群在高输入时流超时

时间:2018-07-24 05:28:35

标签: apache-kafka apache-kafka-streams

我正在运行一个具有7个节点和大量流处理功能的Kafka集群。现在,我在Kafka Streams应用程序中看到了很少出现的错误,例如在高输入速率下:

[2018-07-23 14:44:24,351] ERROR task [0_5] Error sending record to topic topic-name. No more offsets will be recorded for this task and the exception will eventually be thrown (org.apache.kafka.streams.processor.internals.RecordCollectorImpl) org.apache.kafka.common.errors.TimeoutException: Expiring 13 record(s) for topic-name-3: 60060 ms has passed since last append

[2018-07-23 14:44:31,021] ERROR stream-thread [StreamThread-2] Failed to commit StreamTask 0_5 state: (org.apache.kafka.streams.processor.internals.StreamThread) org.apache.kafka.streams.errors.StreamsException: task [0_5] exception caught when producing at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.checkForException(RecordCollectorImpl.java:121) at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.flush(RecordCollectorImpl.java:129) at org.apache.kafka.streams.processor.internals.StreamTask$1.run(StreamTask.java:76) at org.apache.kafka.streams.processor.internals.StreamsMetricsImpl.measureLatencyNs(StreamsMetricsImpl.java:188) at org.apache.kafka.streams.processor.internals.StreamTask.commit(StreamTask.java:281) at org.apache.kafka.streams.processor.internals.StreamThread.commitOne(StreamThread.java:807) at org.apache.kafka.streams.processor.internals.StreamThread.commitAll(StreamThread.java:794) at org.apache.kafka.streams.processor.internals.StreamThread.maybeCommit(StreamThread.java:769) at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:647) at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:361) Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 13 record(s) for topic-name-3: 60060 ms has passed since last append

[2018-07-23 14:44:31,033] ERROR stream-thread [StreamThread-2] Failed while executing StreamTask 0_5 due to flush state: (org.apache.kafka.streams.processor.internals.StreamThread) org.apache.kafka.streams.errors.StreamsException: task [0_5] exception caught when producing at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.checkForException(RecordCollectorImpl.java:121) at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.flush(RecordCollectorImpl.java:129) at org.apache.kafka.streams.processor.internals.StreamTask.flushState(StreamTask.java:423) at org.apache.kafka.streams.processor.internals.StreamThread$4.apply(StreamThread.java:555) at org.apache.kafka.streams.processor.internals.StreamThread.performOnTasks(StreamThread.java:501) at org.apache.kafka.streams.processor.internals.StreamThread.flushAllState(StreamThread.java:551) at org.apache.kafka.streams.processor.internals.StreamThread.shutdownTasksAndState(StreamThread.java:449) at org.apache.kafka.streams.processor.internals.StreamThread.shutdown(StreamThread.java:391) at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:372) Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 13 record(s) for topic-name-3: 60060 ms has passed since last append

[2018-07-23 14:44:31,039] WARN stream-thread [StreamThread-2] Unexpected state transition from RUNNING to NOT_RUNNING. (org.apache.kafka.streams.processor.internals.StreamThread) Exception in thread "StreamThread-2" org.apache.kafka.streams.errors.StreamsException: task [0_5] exception caught when producing at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.checkForException(RecordCollectorImpl.java:121) at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.flush(RecordCollectorImpl.java:129) at org.apache.kafka.streams.processor.internals.StreamTask$1.run(StreamTask.java:76) at org.apache.kafka.streams.processor.internals.StreamsMetricsImpl.measureLatencyNs(StreamsMetricsImpl.java:188) at org.apache.kafka.streams.processor.internals.StreamTask.commit(StreamTask.java:281) at org.apache.kafka.streams.processor.internals.StreamThread.commitOne(StreamThread.java:807) at org.apache.kafka.streams.processor.internals.StreamThread.commitAll(StreamThread.java:794) at org.apache.kafka.streams.processor.internals.StreamThread.maybeCommit(StreamThread.java:769) at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:647) at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:361) Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 13 record(s) for topic-name-3: 60060 ms has passed since last append

如果我降低输入速率(从20k事件/ s到10k事件/ s),错误将消失。所以很明显我正在达到任何极限。我玩过不同的选项(request.timeout.ms,linger.ms和batch.size),但是每次都得到相同的结果。

2 个答案:

答案 0 :(得分:0)

您似乎已达到某种极限。根据消息0631/com.stunntech.fluttertodoapp E/flutter: [ERROR:topaz/lib/tonic/logging/dart_error.cc(16)] Unhandled exception: type '(Database, int) => void' is not a subtype of type '(Database, int) => Future<dynamic>' #0 DatabaseHelper.initDb (package:flutter_todo_app/database/database.dart:29:64) <asynchronous suspension> #1 DatabaseHelper.db (package:flutter_todo_app/database/database.dart:20:17) <asynchronous suspension> #2 DatabaseHelper.saveTodo (package:flutter_todo_app/database/database.dart:40:26) <asynchronous suspension> #3 _MyTodoListState._submitTodo (package:flutter_todo_app/todo_list.dart:144:30) <asynchronous suspension> #4 _MyTodoListState._showAlert.<anonymous closure> (package:flutter_todo_app/todo_list.dart:97:19) #5 GestureRecognizer.invokeCallback (package:flutter/src/gestures/recognizer.dart:102:24) #6 TapGestureRecognizer._checkUp (package:flutter/src/gestures/tap.dart:161:9) #7 TapGestureRecognizer.handlePrimaryPointer (package:flutter/src/gestures/tap.dart:94:7) #8 PrimaryPointerGestureRecognizer.handleEvent (package:flutter/src/gestures/recognizer.dart:315:9) #9 PointerRouter._dispatch (package:flutter/src/gestures/pointer_router.dart:73:12) #10 PointerRouter.route (package:flutter/src/gestures/pointer_router.dart:101:11) #11 _WidgetsFlutterBinding&BindingBase&GestureBinding.handleEvent (package:flutter/src/gestures/binding.dart:143:19) #12 _WidgetsFlutterBinding&BindingBase&GestureBinding.dispatchEvent (package:flutter/src/gestures/binding.dart:121:22) #13 _WidgetsFlutterBinding&BindingBase&GestureBinding._handlePointerEvent (package:flutter/src/gestures/binding.dart:101:7) #14 _WidgetsFlutterBinding&BindingBase&GestureBinding._flushPointerEventQueue (package:flutter/src/gestures/binding.dart:64:7) #15 _WidgetsFlutterBinding&BindingBase&GestureBinding._handlePointerDataPacket (package:flutter/src/gestures/binding.dart:48:7) #16 _invoke1 (dart:ui/hooks.dart:134:13) #17 _dispatchPointerDataPacket (dart:ui/hooks.dart:91:5)enter code here ,我认为这是由于高负载而使线程处于饥饿状态,因此磁盘将是首先要检查的东西:

  • 磁盘使用情况-如果达到写入速度限制,则从硬盘驱动器切换到固态硬盘可能会有所帮助
  • 负载分配-您的流量是否平均分配给所有节点?
  • CPU负载-可以进行大量处理

答案 1 :(得分:0)

我们有类似的问题。 在我们的案例中,我们具有以下用于复制和确认的配置:

replication.factor: 3
producer.acks: all

在高负载下,同一错误多次发生TimeoutException: Expiring N record(s) for topic: N ms has passed since last append

在删除我们的自定义replication.factorproducer.acks配置(因此我们现在使用默认值)后,此错误消失了。 无疑,在生产者端要花费更多的时间,直到领导者将收到完整的同步副本以确认记录,直到用指定的replication.factor复制记录为止。 使用默认值对您的容错能力的保护会稍差一些。

还可能考虑增加每个主题的分区数量和应用程序节点的数量(您的kafka流逻辑在其中处理)。