火花流加入kafka主题

时间:2019-05-24 14:51:35

标签: scala apache-spark apache-kafka spark-streaming

我们有两个来自Kafka主题的两个InputDStream,但是我们必须将这两个输入的数据结合在一起。 问题在于,由于InputDStream,每个foreachRDD都是独立处理的,之后什么都无法返回到join

  var Message1ListBuffer = new ListBuffer[Message1]
  var Message2ListBuffer = new ListBuffer[Message2]

    inputDStream1.foreachRDD(rdd => {
      if (!rdd.partitions.isEmpty) {
        val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
        rdd.map({ msg =>
          val r = msg.value()
          val avro = AvroUtils.objectToAvro(r.getSchema, r)
          val messageValue = AvroInputStream.json[FMessage1](avro.getBytes("UTF-8")).singleEntity.get
          Message1ListBuffer = Message1FlatMapper.flatmap(messageValue)
          Message1ListBuffer
        })
        inputDStream1.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
      }
    })


    inputDStream2.foreachRDD(rdd => {
      if (!rdd.partitions.isEmpty) {
        val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
        rdd.map({ msg =>
          val r = msg.value()
          val avro = AvroUtils.objectToAvro(r.getSchema, r)
          val messageValue = AvroInputStream.json[FMessage2](avro.getBytes("UTF-8")).singleEntity.get
          Message2ListBuffer = Message1FlatMapper.flatmap(messageValue)
          Message2ListBuffer

        })
        inputDStream2.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
      }
    })

我以为我可以返回Message1ListBuffer和Message2ListBuffer,将它们转换为数据帧并加入它们。但这是行不通的,而且我认为这不是最佳选择

从那里,为了进行联接,返回每个foreachRDD的rdd的方法是什么?

inputDStream1.foreachRDD(rdd => {

})


inputDStream2.foreachRDD(rdd => {

})

1 个答案:

答案 0 :(得分:1)

不确定您使用的Spark版本是什么,对于Spark 2.3+,可以直接实现。

使用Spark> = 2.3

订阅2个您想加入的主题

val ds1 = spark
  .readStream 
  .format("kafka")
  .option("kafka.bootstrap.servers", "brokerhost1:port1,brokerhost2:port2")
  .option("subscribe", "source-topic1")
  .option("startingOffsets", "earliest")
  .option("endingOffsets", "latest")
  .load

val ds2 = spark
  .readStream 
  .format("kafka")
  .option("kafka.bootstrap.servers", "brokerhost1:port1,brokerhost2:port2")
  .option("subscribe", "source-topic2")
  .option("startingOffsets", "earliest")
  .option("endingOffsets", "latest")
  .load

格式化两个流中的已订阅消息

val stream1 = ds1.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
  .as[(String, String)]

val stream2 = ds2.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
  .as[(String, String)]

同时加入两个流

resultStream = stream1.join(stream2)

更多join operations here

  

警告:

     

延迟记录将不会获得联接匹配。需要调整缓冲一点。 more information found here