Question

我遇到了停用apache风暴拓扑的高CPU使用率问题。我可以使用以下步骤可靠地重新创建问题，但我还没有找到确切的原因或解决方案。

环境是一个正在运行1个拓扑的风暴群集（拓扑非常简单，我使用了感叹号示例）。这是非活跃的。最初有正常的CPU使用率。但是，当我杀死所有管理程序上的所有拓扑JVM进程并让Storm再次重启它们时，我发现一段时间后（约9小时）每个JVM进程的CPU使用率几乎达到100％。我已经测试了一个ACTIVE拓扑，但这不会发生。我还测试了多个拓扑，并在它们处于INACTIVE状态时观察到相同的结果。

重新制作的步骤：

在Apache Storm群集上运行1拓扑
取消激活
在所有主管上杀死所有拓扑JVM进程（Storm将重新启动它们）
对于所有 INACTIVE 拓扑JVM进程，观察Supervisor上的CPU使用率最高接近100％。

环境

Apache Storm 1.1.0在3个VM，1个nimbus和2个主管上运行。

群集摘要：

主管：2
使用过的插槽：2
可用的老虎机：38
总插槽：40
执行人：50
任务：50

拓扑结构有2个worker和50个执行器/任务（线程）。

调查：

除了能够可靠地重新创建问题之外，我已经为受影响的拓扑JVM进程确定了使用最多CPU的线程。该过程共有102个线程，97个被阻止，5个IN_NATIVE。使用最多CPU的线程是相同的，其中有23个（全部处于BLOCKED状态）：

Thread 28558: (state = BLOCKED)
 - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise)
 - java.util.concurrent.locks.LockSupport.parkNanos(long) @bci=11, line=338 (Compiled frame)
 - com.lmax.disruptor.MultiProducerSequencer.next(int) @bci=82, line=136 (Compiled frame)
 - com.lmax.disruptor.RingBuffer.next(int) @bci=5, line=260 (Interpreted frame)
 - org.apache.storm.utils.DisruptorQueue.publishDirect(java.util.ArrayList, boolean) @bci=18, line=517 (Interpreted frame)
 - org.apache.storm.utils.DisruptorQueue.access$1000(org.apache.storm.utils.DisruptorQueue, java.util.ArrayList, boolean) @bci=3, line=61 (Interpreted frame)
 - org.apache.storm.utils.DisruptorQueue$ThreadLocalBatcher.flush(boolean) @bci=50, line=280 (Interpreted frame)
 - org.apache.storm.utils.DisruptorQueue$Flusher.run() @bci=55, line=303 (Interpreted frame)
 - java.util.concurrent.Executors$RunnableAdapter.call() @bci=4, line=511 (Compiled frame)
 - java.util.concurrent.FutureTask.run() @bci=42, line=266 (Compiled frame)
 - java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) @bci=95, line=1142 (Compiled frame)
 - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=617 (Interpreted frame)
 - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)

我通过使用jstack来获取进程的线程转储来识别此线程：

jstack -F <pid> > jstack-<pid>.txt

和top使用最多的CPU识别进程中的线程：

top -H -p <pid>

之前是否有人遇到此问题或类似问题？任何帮助将不胜感激。

Answer 1

问题发生的原因是DisruptorQueue中的RingBuffer填满了，当发布线程试图声明一个插槽时，他们实际上卡住了LockSupport.parkNanos（1L）。根据我对Storm JIRA

的评论

Apache Storm停用拓扑会导致高CPU使用率

1 个答案: