Spark Streaming - 有没有办法联合两个JavaInputDstreams,对统一流和提交偏移量执行转换

时间:2018-04-03 19:11:54

标签: java apache-spark apache-kafka spark-streaming

spark使用者必须从不同的Bootstrap服务器中读取具有相同名称的主题。因此需要创建两个JavaDstream,执行union,处理流并提交偏移量。

JavaInputDStream<ConsumerRecord<String, GenericRecord>> dStream = KafkaUtils.createDirectStream(...);

问题是JavaInputDStream不支持dStream.Union(stream2);

如果我使用,

JavaDStream<ConsumerRecord<String, GenericRecord>> dStream= KafkaUtils.createDirectStream(...);

但是JavaDstream不支持,

((CanCommitOffsets) dStream.inputDStream()).commitAsync(os);

1 个答案:

答案 0 :(得分:0)

请尽快回答。

我没有直接的方法来做到这一点,我想首先将Dstream转换为数据集/数据帧,然后在两个数据帧/数据集上执行UNION。

以下代码未经过测试,但这应该有效。请随时验证并进行必要的更改以使其正常工作。

JavaPairInputDStream<String, String> pairDstream1 = KafkaUtils.createDirectStream(ssc,kafkaParams, topics);
JavaPairInputDStream<String, String> pairDstream2 = KafkaUtils.createDirectStream(ssc,kafkaParams, topics);

//Create JavaDStream<String>
JavaDStream<String> dstream1 = pairDstream1.map(new Function<Tuple2<String, String>, String>() {
    @Override
    public String call(Tuple2<String, String> tuple2) {
      return tuple2._2();
    }
  });

//Create JavaDStream<String>
JavaDStream<String> dstream1 = pairDstream2.map(new Function<Tuple2<String, String>, String>() {
    @Override
    public String call(Tuple2<String, String> tuple2) {
      return tuple2._2();
    }
  });


//Create JavaRDD<Row>
pairDstream1.foreachRDD(new VoidFunction<JavaRDD<String>>() {
      @Override
      public void call(JavaRDD<String> rdd) { 
  JavaRDD<Row> rowRDD = rdd.map(new Function<String, Row>() {
      @Override
      public Row call(String msg) {
Row row = RowFactory.create(msg);
return row;
      }
    });

 //Create JavaRDD<Row>
pairDstream2.foreachRDD(new VoidFunction<JavaRDD<String>>() {
      @Override
      public void call(JavaRDD<String> rdd) { 
  JavaRDD<Row> rowRDD = rdd.map(new Function<String, Row>() {
      @Override
      public Row call(String msg) {
Row row = RowFactory.create(msg);
return row;
      }
    });
//Create Schema       
StructType schema = DataTypes.createStructType(new StructField[] {DataTypes.createStructField("Message", DataTypes.StringType, true)});
//Get Spark 2.0 session       
SparkSession spark = JavaSparkSessionSingleton.getInstance(rdd.context().getConf());
Dataset<Row> df1 = spark.createDataFrame(rowRDD, schema);
Dataset<Row> df2 = spark.createDataFrame(rowRDD, schema);
//union the both dataframes
df1.union(df2);