喷嘴没有得到回应

时间:2014-06-21 14:52:55

标签: apache-storm

在我们的一个包含1个喷口和1个螺栓的拓扑结构中 - 我有一种预感,即螺栓完成正常(并且正在进行)但是喷嘴仍在失效。

我尝试通过如下的TaskHook来确认这一点 -

public class BaseHook extends BaseTaskHook {

    private Logger logger;
    private String topology;
    private String component;

    public BaseHook(String component) {
        this.component = component;
    }

    @Override
    public void prepare(Map conf, TopologyContext context) {
        logger = LoggerFactory.getLogger(this.getClass());
        this.topology = (String) conf.get("topology.name");
    }

    @Override
    public void emit(EmitInfo info) {
        log("EMITTED >> Value = " + info.values);
    }

    @Override
    public void spoutAck(SpoutAckInfo info) {
        log("ACKED >> Tuple = " + info.messageId + ", Latency = " + info.completeLatencyMs);
    }

    @Override
    public void spoutFail(SpoutFailInfo info) {
        log("FAILED >> Tuple = " + info.messageId + ", Latency = " + info.failLatencyMs);
    }

    @Override
    public void boltExecute(BoltExecuteInfo info) {
        log("EXECUTED >> Tuple = " + info.tuple.getValues() + ", Latency = " + info.executeLatencyMs);
    }

    @Override
    public void boltAck(BoltAckInfo info) {
        log("ACKED >> Tuple = " + info.tuple.getValues() + ", Latency = " + info.processLatencyMs);
    }

    @Override
    public void boltFail(BoltFailInfo info) {
        log("FAILED >> Tuple = " + info.tuple.getValues() + ", Latency = " + info.failLatencyMs);
    }

    private void log(String msg) {
        logger.info(">>>>> " + topology + " >> " + component + " >> " + msg);
    }
}

原来我的预感是正确的。日志看起来像这样 -

>>>>> TopologyX >> SpoutX >> EMITTED >> Value = [XXXXXXXXX]
>>>>> TopologyX >> BoltX >> ACKED >> Tuple = [XXXXXXXXX], Latency = 1972
>>>>> TopologyX >> BoltX >> EXECUTED >> Tuple = [XXXXXXXXX], Latency = 1973
>>>>> TopologyX >> SpoutX >> FAILED >> Tuple = XXXXXXXXX, Latency = 53913

即。 Bolt几乎花费了2s(To Execute和Ack),但是Spout Fail被调用大约53s(几乎是topology.message.timeout.secs * 2的两倍。

我希望在2-3秒内也可以调用Spout Ack。喷嘴是无阻塞的,螺栓和螺栓都有足够的工作能力。

任何人都有任何暗示可能是什么原因?


更新

所以这就是风暴群集的样子 -

  • 4个拓扑
    • T1 = S> B> B> B>的Ack /失败
    • T2 = S> B>的Ack /失败
    • T3 = S> B> B>的Ack /失败
    • T4 =
      • S> B>的Ack /失败
      • S> B>的Ack /失败

因此,有问题的拓扑是T4即。一个有2个不同的喷口和2个螺栓。其中一个流程通常工作正常(它们具有唯一标识元组的不同messageIds)

这可能是问题吗?

反正,

  • 我们尝试尽可能减少遗嘱执行人,但在T4中并没有改善任何事情。
  • 我们禁用了所有其他拓扑,并且T4
  • 的内容完全正常
  • 我们启用了T1,但仍然运行良好
  • 我们启用了T2(以及T3其他场合)T4开始失败

现在,

  • 在一个随机的场合,T4甚至可以使用T1和T3。
  • 但是,除非启用T2T3,否则T4会崩溃。

注意事项 -

  • T3T4都是快速拓扑,即。他们的流程在< 100ms的
  • 每个Spout和Bolts只有T3T4两个执行器
  • T3T4都有Max Tuple Pending = 1
  • 我们希望对T3T4进行速率限制(但已经尝试过没有速率限制)
    • 尝试1:没有任何限制
    • 尝试2:发光前睡眠50ms
    • 尝试3:发光后睡眠50ms
    • 尝试4:不要睡觉,但只有在距离最后一次发射50秒时才会发光
    • 什么都没有用

基于评论的附加信息

所有Spouts都从BaseSpout类扩展 -

public abstract class BaseSpout extends BaseRichSpout {
    private SpoutOutputCollector collector;

    @Override
    public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) {
        context.addTaskHook(new BaseHook(this.getClass().getSimpleName()));
        try {
            this.collector = collector;
            open();
        } catch (Exception e) {
            throw new RuntimeException("Error when preparing spout", e);
        }
    }

    @Override
    public void nextTuple() {
        try {
            getTuple();
        } catch (Throwable t) {
            if (!(t instanceof FailedException)) {
                t = new FailedException("nextTuple()", t);
            }
            collector.reportError(t);
        }
    }

    @Override
    public void declareOutputFields(OutputFieldsDeclarer declarer) {
        String[] fields = getFields();
        if (fields != null) {
            declarer.declare(new Fields(fields));
        }
    }

    protected void emit(Values values, String msgId) {
        collector.emit(values, msgId);
    }

    protected abstract void open() throws Exception;

    protected abstract void getTuple() throws Exception;

    protected abstract String[] getFields();
}

并且所有的螺栓都从BaseBolt类扩展 -

public abstract class BaseBolt extends BaseRichBolt {

    private OutputCollector collector;

    @Override
    public void prepare(Map stormConf, TopologyContext context, OutputCollector collector) {   
        context.addTaskHook(new BaseHook(this.getClass().getSimpleName()));
        try {
            this.collector = collector;
            prepare();
        } catch (Exception e) {
            throw new RuntimeException("Error when preparing bolt", e);
        }
    }

    @Override
    public void execute(Tuple tuple) {
        try {
            process(tuple);
            collector.ack(tuple);
        } catch (Throwable t) {
            if (!(t instanceof FailedException)) {
                t = new FailedException("execute(" + tuple + ")", t);
            }
            collector.reportError(t);
            collector.fail(tuple);
        }
    }

    @Override
    public void declareOutputFields(OutputFieldsDeclarer declarer) {
        String[] fields = getFields();
        if (fields != null) {
            declarer.declare(new Fields(fields));
        }
    }

    protected void emit(Tuple tuple, Values values) {
        collector.emit(tuple, values);
    }

    protected abstract void prepare() throws Exception;

    protected abstract void process(Tuple tuple) throws Exception;

    protected abstract String[] getFields();
}

所以说,没有发出没有messageID(来自spout)或unanchored tuple(来自bolt)的元组

1 个答案:

答案 0 :(得分:0)

这里的问题是对Spout.nextTuple()Spout.ack()Spout.fail()的调用都发生在同一线程上。如果您将大量元组放入拓扑中,则确认或失败消息最终将等待源喷嘴处理,从而导致确认/失败的时间延长。

您还提到“睡觉”没有效果。如果您是说在喷口Thread.sleep()方法中调用了nextTuple(),那么这只会使情况变得更糟,因为您正在停止将处理确认/失败的线程。