spark使用者必须从不同的Bootstrap服务器中读取具有相同名称的主题。因此需要创建两个JavaDstream,执行union,处理流并提交偏移量。
JavaInputDStream<ConsumerRecord<String, GenericRecord>> dStream = KafkaUtils.createDirectStream(...);
问题是JavaInputDStream不支持dStream.Union(stream2)
;
如果我使用,
JavaDStream<ConsumerRecord<String, GenericRecord>> dStream= KafkaUtils.createDirectStream(...);
但是JavaDstream不支持,
((CanCommitOffsets) dStream.inputDStream()).commitAsync(os);
答案 0 :(得分:0)
请尽快回答。
我没有直接的方法来做到这一点,我想首先将Dstream转换为数据集/数据帧,然后在两个数据帧/数据集上执行UNION。
以下代码未经过测试,但这应该有效。请随时验证并进行必要的更改以使其正常工作。
JavaPairInputDStream<String, String> pairDstream1 = KafkaUtils.createDirectStream(ssc,kafkaParams, topics);
JavaPairInputDStream<String, String> pairDstream2 = KafkaUtils.createDirectStream(ssc,kafkaParams, topics);
//Create JavaDStream<String>
JavaDStream<String> dstream1 = pairDstream1.map(new Function<Tuple2<String, String>, String>() {
@Override
public String call(Tuple2<String, String> tuple2) {
return tuple2._2();
}
});
//Create JavaDStream<String>
JavaDStream<String> dstream1 = pairDstream2.map(new Function<Tuple2<String, String>, String>() {
@Override
public String call(Tuple2<String, String> tuple2) {
return tuple2._2();
}
});
//Create JavaRDD<Row>
pairDstream1.foreachRDD(new VoidFunction<JavaRDD<String>>() {
@Override
public void call(JavaRDD<String> rdd) {
JavaRDD<Row> rowRDD = rdd.map(new Function<String, Row>() {
@Override
public Row call(String msg) {
Row row = RowFactory.create(msg);
return row;
}
});
//Create JavaRDD<Row>
pairDstream2.foreachRDD(new VoidFunction<JavaRDD<String>>() {
@Override
public void call(JavaRDD<String> rdd) {
JavaRDD<Row> rowRDD = rdd.map(new Function<String, Row>() {
@Override
public Row call(String msg) {
Row row = RowFactory.create(msg);
return row;
}
});
//Create Schema
StructType schema = DataTypes.createStructType(new StructField[] {DataTypes.createStructField("Message", DataTypes.StringType, true)});
//Get Spark 2.0 session
SparkSession spark = JavaSparkSessionSingleton.getInstance(rdd.context().getConf());
Dataset<Row> df1 = spark.createDataFrame(rowRDD, schema);
Dataset<Row> df2 = spark.createDataFrame(rowRDD, schema);
//union the both dataframes
df1.union(df2);