我之前使用的是暴风雨,我需要更多的批处理功能,所以我在风暴中搜索批处理。 我发现三叉戟实时进行微批处理。
但不知何故,我无法弄清楚Trident如何处理微批处理(流量,批量大小,批处理间隔)以确定它真的有我需要的东西。
我想要做的是收集/保存一个间隔中的喷口发出的元组,并用另一个时间间隔将它们重新发射到下游组件/螺栓/功能。 (例如,spout每秒发出一个元组,下一个三叉戟函数将收集/保存元组,每分钟发出50个元组到下一个函数。)
在这种情况下,有人可以指导我如何申请Trident吗? 或使用风暴功能的任何其他适用方式?
答案 0 :(得分:2)
很棒的问题!但遗憾的是,三叉戟盒子不支持这种微型批处理。
但您可以尝试实施自己的频率驱动微批处理。像这个骨架的例子:
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.concurrent.LinkedBlockingQueue;
import org.apache.storm.task.OutputCollector;
import org.apache.storm.task.TopologyContext;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.topology.base.BaseRichBolt;
import org.apache.storm.tuple.Tuple;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class MicroBatchingBolt extends BaseRichBolt {
private static final long serialVersionUID = 8500984730263268589L;
private static final Logger LOG = LoggerFactory.getLogger(MicroBatchingBolt.class);
protected LinkedBlockingQueue<Tuple> queue = new LinkedBlockingQueue<Tuple>();
/** The threshold after which the batch should be flushed out. */
int batchSize = 100;
/**
* The batch interval in sec. Minimum time between flushes if the batch sizes
* are not met. This should typically be equal to
* topology.tick.tuple.freq.secs and half of topology.message.timeout.secs
*/
int batchIntervalInSec = 45;
/** The last batch process time seconds. Used for tracking purpose */
long lastBatchProcessTimeSeconds = 0;
private OutputCollector collector;
@Override
@SuppressWarnings("rawtypes")
public void prepare(Map stormConf, TopologyContext context, OutputCollector collector) {
this.collector = collector;
}
@Override
public void execute(Tuple tuple) {
// Check if the tuple is of type Tick Tuple
if (isTickTuple(tuple)) {
// If so, it is indication for batch flush. But don't flush if previous
// flush was done very recently (either due to batch size threshold was
// crossed or because of another tick tuple
if ((System.currentTimeMillis() / 1000 - lastBatchProcessTimeSeconds) >= batchIntervalInSec) {
LOG.debug("Current queue size is " + this.queue.size()
+ ". But received tick tuple so executing the batch");
finishBatch();
} else {
LOG.debug("Current queue size is " + this.queue.size()
+ ". Received tick tuple but last batch was executed "
+ (System.currentTimeMillis() / 1000 - lastBatchProcessTimeSeconds)
+ " seconds back that is less than " + batchIntervalInSec
+ " so ignoring the tick tuple");
}
} else {
// Add the tuple to queue. But don't ack it yet.
this.queue.add(tuple);
int queueSize = this.queue.size();
LOG.debug("current queue size is " + queueSize);
if (queueSize >= batchSize) {
LOG.debug("Current queue size is >= " + batchSize
+ " executing the batch");
finishBatch();
}
}
}
private boolean isTickTuple(Tuple tuple) {
// Check if it is tick tuple here
return false;
}
/**
* Finish batch.
*/
public void finishBatch() {
LOG.debug("Finishing batch of size " + queue.size());
lastBatchProcessTimeSeconds = System.currentTimeMillis() / 1000;
List<Tuple> tuples = new ArrayList<Tuple>();
queue.drainTo(tuples);
for (Tuple tuple : tuples) {
// Prepare your batch here (may it be JDBC, HBase, ElasticSearch, Solr or
// anything else.
// List<Response> responses = externalApi.get("...");
}
try {
// Execute your batch here and ack or fail the tuples
LOG.debug("Executed the batch. Processing responses.");
// for (int counter = 0; counter < responses.length; counter++) {
// if (response.isFailed()) {
// LOG.error("Failed to process tuple # " + counter);
// this.collector.fail(tuples.get(counter));
// } else {
// LOG.debug("Successfully processed tuple # " + counter);
// this.collector.ack(tuples.get(counter));
// }
// }
} catch (Exception e) {
LOG.error("Unable to process " + tuples.size() + " tuples", e);
// Fail entire batch
for (Tuple tuple : tuples) {
this.collector.fail(tuple);
}
}
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
// ...
}
}
来源:http://hortonworks.com/blog/apache-storm-design-pattern-micro-batching/和Using tick tuples with trident in storm