使Spark Streaming启动Spark Ba​​tch作业

时间:2016-03-17 20:34:04

标签: java apache-spark hbase apache-kafka spark-streaming

我试图从Spark,Kafka和HBase实现lambda架构。虽然如果有一个更简单的选项肯定可以改变它,我目前的想法包括从Kafka获取东西的流媒体作业,将它们存储到HBase然后对数据进行计算,同时还计算生产的最早时间戳。数据(该时间戳是Kafka中每条消息的密钥)。在计算时间戳时,需要检查批处理作业是否正在运行,如果不是,则需要启动新的批处理作业,其中上限是该时间戳。我不确定流作业如何控制和启动批处理作业。有什么想法吗?

这是一段代码,其中包含对所需内容和一些背景信息的一些评论:

    final String lowBoundary = args[0];
    final String highBoundary = args[1];

    SparkConf conf = new SparkConf()
            .setMaster("spark://dissertation:7077")
            .setAppName("LambdaStream");
    JavaStreamingContext ssc = new JavaStreamingContext(conf, Durations.seconds(2));

    ssc.checkpoint("hdfs://localhost:9000/spark");

    HashSet<String> topicSet = new HashSet<String>();
    topicSet.add("generator");

    HashMap<String, String> kafkaMap = new HashMap<>();
    kafkaMap.put("metadata.broker.list", "localhost:9092");

    JavaPairInputDStream<String, String> stream = KafkaUtils
            .createDirectStream(ssc, String.class, String.class,
                    StringDecoder.class, StringDecoder.class, kafkaMap, topicSet);

    // Data is generated from TPC-H, so every row, or every Kafka message has the data's timestamp as key
    // and all the attributes separated by |
    JavaPairDStream<String, String[]> tuples = stream
            .mapToPair(new PairFunction<Tuple2<String, String>, String, String[]>() {

                @Override
                public Tuple2<String, String[]> call(Tuple2<String, String> tuple) {

                    String[] split = tuple._2.split(Pattern.quote("|"));

                    return new Tuple2<String, String[]>(tuple._1, split);
                }
            });

    // I'll omit the saving to HBase code as it's too long and useless

    JavaDStream<Integer> filterAndCount = tuples.filter(new Function<Tuple2<String, String[]>, Boolean>() {

        @Override
        public Boolean call(Tuple2<String, String[]> tuple) {
            if (Long.parseLong(tuple._2[0]) > Long.parseLong(lowBoundary)
                    && Long.parseLong(tuple._2[0]) < Long.parseLong(highBoundary)) {
                return true;
            } else {
                return false;
            }
        }
    }).map(new Function<Tuple2<String, String[]>, Integer>() {

        @Override
        public Integer call(Tuple2<String, String[]> tuple) throws Exception {
            return tuple._2[15].split(" ").length;
        }
    }).reduce(new Function2<Integer, Integer, Integer>() {

        @Override
        public Integer call(Integer a, Integer b) throws Exception {
            return a + b;
        }
    });

    // The lowest timestamp is used to pe provided as the higher boundary for the batch job and is
    // also going to be used as the key for the HBase table in which the result from this microbatch is stored
    JavaDStream<Long> lowestTimestamp = stream.map(new Function<Tuple2<String, String>, Long>() {

        @Override
        public Long call(Tuple2<String, String> tuple) throws Exception {
            return Long.parseLong(tuple._1);
        }

    }).reduce(new Function2<Long, Long, Long>() {

        @Override
        public Long call(Long a, Long b) throws Exception {
            if (a > b) {
                return b;
            } else {
                return a;
            }
        }

    });
    // After calculation of the smallest timestamp I need to check if batch is over and if so start it with a scan limited to the
    // minimum timestamp provided by this.

编辑:按要求添加了代码

0 个答案:

没有答案