我试图从Spark,Kafka和HBase实现lambda架构。虽然如果有一个更简单的选项肯定可以改变它,我目前的想法包括从Kafka获取东西的流媒体作业,将它们存储到HBase然后对数据进行计算,同时还计算生产的最早时间戳。数据(该时间戳是Kafka中每条消息的密钥)。在计算时间戳时,需要检查批处理作业是否正在运行,如果不是,则需要启动新的批处理作业,其中上限是该时间戳。我不确定流作业如何控制和启动批处理作业。有什么想法吗?
这是一段代码,其中包含对所需内容和一些背景信息的一些评论:
final String lowBoundary = args[0];
final String highBoundary = args[1];
SparkConf conf = new SparkConf()
.setMaster("spark://dissertation:7077")
.setAppName("LambdaStream");
JavaStreamingContext ssc = new JavaStreamingContext(conf, Durations.seconds(2));
ssc.checkpoint("hdfs://localhost:9000/spark");
HashSet<String> topicSet = new HashSet<String>();
topicSet.add("generator");
HashMap<String, String> kafkaMap = new HashMap<>();
kafkaMap.put("metadata.broker.list", "localhost:9092");
JavaPairInputDStream<String, String> stream = KafkaUtils
.createDirectStream(ssc, String.class, String.class,
StringDecoder.class, StringDecoder.class, kafkaMap, topicSet);
// Data is generated from TPC-H, so every row, or every Kafka message has the data's timestamp as key
// and all the attributes separated by |
JavaPairDStream<String, String[]> tuples = stream
.mapToPair(new PairFunction<Tuple2<String, String>, String, String[]>() {
@Override
public Tuple2<String, String[]> call(Tuple2<String, String> tuple) {
String[] split = tuple._2.split(Pattern.quote("|"));
return new Tuple2<String, String[]>(tuple._1, split);
}
});
// I'll omit the saving to HBase code as it's too long and useless
JavaDStream<Integer> filterAndCount = tuples.filter(new Function<Tuple2<String, String[]>, Boolean>() {
@Override
public Boolean call(Tuple2<String, String[]> tuple) {
if (Long.parseLong(tuple._2[0]) > Long.parseLong(lowBoundary)
&& Long.parseLong(tuple._2[0]) < Long.parseLong(highBoundary)) {
return true;
} else {
return false;
}
}
}).map(new Function<Tuple2<String, String[]>, Integer>() {
@Override
public Integer call(Tuple2<String, String[]> tuple) throws Exception {
return tuple._2[15].split(" ").length;
}
}).reduce(new Function2<Integer, Integer, Integer>() {
@Override
public Integer call(Integer a, Integer b) throws Exception {
return a + b;
}
});
// The lowest timestamp is used to pe provided as the higher boundary for the batch job and is
// also going to be used as the key for the HBase table in which the result from this microbatch is stored
JavaDStream<Long> lowestTimestamp = stream.map(new Function<Tuple2<String, String>, Long>() {
@Override
public Long call(Tuple2<String, String> tuple) throws Exception {
return Long.parseLong(tuple._1);
}
}).reduce(new Function2<Long, Long, Long>() {
@Override
public Long call(Long a, Long b) throws Exception {
if (a > b) {
return b;
} else {
return a;
}
}
});
// After calculation of the smallest timestamp I need to check if batch is over and if so start it with a scan limited to the
// minimum timestamp provided by this.
编辑:按要求添加了代码