Question

我们生产的Storm集群中出现的一个问题，我们无法理解或解决这个问题。

在某些时候，似乎kafka鲸鱼喷水停止从一半的主题分区读取。有40个分区，它只读取其中的20个。我们无法在发生这种情况时找到我们对风暴集群或kafka所做的任何更改。

我们更改了使用者组ID并将spout config startOffsetTime设置为OffsetRequest.LatestTime，以尝试从所有分区读取新数据。它仍然只连接到相同的20个分区。我们已经查看了Storm zookeeper中的节点/<topic>/<consumer_group>，并且只看到了20个分区。

我们已经验证消息正在发布到所有40个分区。

Kafka版本为0.9.0.1，风暴版本为1.1.0。

非常感谢有关如何调试或在哪里观看的任何提示。我是否提到这是在生产中发生的？我是否在一周前开始提到它，我们今天早上才注意到了？：（

其他信息：我们在Kafka状态更改日志中发现了一些错误（分区9是受影响的分区之一，日志中的时间戳看起来与问题开始的时间有关）< / p>

kafka.common.NoReplicaOnlineException: No replica for partition 
[transcription-results,9] is alive. Live brokers are: [Set()], Assigned replicas are: [List(1, 4, 0)]
[2018-03-14 03:11:40,863] TRACE Controller 0 epoch 44 changed state of replica 1 for partition [transcription-results,9] from OnlineReplica to OfflineReplica (state.change.logger)
[2018-03-14 03:11:41,141] TRACE Controller 0 epoch 44 sending become-follower LeaderAndIsr request (Leader:-1,ISR:0,4,LeaderEpoch:442,ControllerEpoch:44) to broker 4 for partition [transcription-results,9] (state.change.logger)
[2018-03-14 03:11:41,145] TRACE Controller 0 epoch 44 sending become-follower LeaderAndIsr request (Leader:-1,ISR:0,4,LeaderEpoch:442,ControllerEpoch:44) to broker 0 for partition [transcription-results,9] (state.change.logger)
[2018-03-14 03:11:41,208] TRACE Controller 0 epoch 44 changed state of replica 4 for partition [transcription-results,9] from OnlineReplica to OnlineReplica (state.change.logger)
[2018-03-14 03:11:41,218] TRACE Controller 0 epoch 44 changed state of replica 1 for partition [transcription-results,9] from OfflineReplica to OnlineReplica (state.change.logger)
[2018-03-14 03:11:41,226] TRACE Controller 0 epoch 44 sending become-follower LeaderAndIsr request (Leader:-1,ISR:0,4,LeaderEpoch:442,ControllerEpoch:44) to broker 4 for partition [transcription-results,9] (state.change.logger)
[2018-03-14 03:11:41,230] TRACE Controller 0 epoch 44 sending become-follower LeaderAndIsr request (Leader:-1,ISR:0,4,LeaderEpoch:442,ControllerEpoch:44) to broker 1 for partition [transcription-results,9] (state.change.logger)
[2018-03-14 03:11:41,450] TRACE Broker 0 received LeaderAndIsr request (LeaderAndIsrInfo:Leader:-1,ISR:0,4,LeaderEpoch:442,ControllerEpoch:44),ReplicationFactor:3),AllReplicas:1,4,0) correlation id 158 from controller 0 epoch 44 for partition [transcription-results,9] (state.change.logger)
[2018-03-14 03:11:41,454] TRACE Broker 0 handling LeaderAndIsr request correlationId 158 from controller 0 epoch 44 starting the become-follower transition for partition [transcription-results,9] (state.change.logger)
[2018-03-14 03:11:41,455] ERROR Broker 0 received LeaderAndIsrRequest with correlation id 158 from controller 0 epoch 44 for partition [transcription-results,9] but cannot become follower since the new leader -1 is unavailable. (state.change.logger)
//... removed some TRACE statements here
[2018-03-14 03:11:41,908] WARN Broker 0 ignoring LeaderAndIsr request from controller 1 with correlation id 1 epoch 47 for partition [transcription-results,9] since its associated leader epoch 441 is old. Current leader epoch is 441 (state.change.logger)
[2018-03-14 03:11:41,982] TRACE Broker 0 cached leader info (LeaderAndIsrInfo:Leader:1,ISR:0,1,4,LeaderEpoch:441,ControllerEpoch:44),ReplicationFactor:3),AllReplicas:1,4,0) for partition [transcription-results,9] in response to UpdateMetadata request sent by controller 1 epoch 47 with correlation id 2 (state.change.logger)
[2018-03-22 14:43:36,098] TRACE Broker 0 received LeaderAndIsr request (LeaderAndIsrInfo:Leader:-1,ISR:,LeaderEpoch:444,ControllerEpoch:47),ReplicationFactor:3),AllReplicas:1,4,0) correlation id 679 from controller 1 epoch 47 for partition [transcription-results,9] (state.change.logger)

可能是由此错误引起的：https://issues.apache.org/jira/browse/KAFKA-3963

但我们怎样才能从中恢复过来？

Answer 1

我首先查看Kafka在/ broker / topics下的Zookeeper，以验证是否列出了所有分区。这就是storm-kafka从中读取分区的地方。

Apache风暴kafka spout只读取主题分区的一半

1 个答案: