如何与Spark Streaming DStream并行接收kafka主题的多个消息

时间:2018-07-24 08:03:05

标签: apache-spark apache-kafka spark-streaming

我使用了一个mysql CDC(更改数据捕获)系统,该系统可以捕获插入到mysql表中并发送到kafka的新记录,每个表一个主题。然后在Spark Streaming中,我想与Spark Streaming DStream API并行接收多条有关Kafka主题的消息,以进一步处理这些mysql表中这些变化的数据。

CDC的设置就很好了,kafka-consume-topic.sh对所有表的消息进行了测试,都可以接收到。但是在星型流中,我只能收到一张表的消息。但是,如果仅在应用程序中为所有表的测试一一创建了一个主题/流,则单独测试所有表都可以通过Spark Streaming读取。我搜索了很长时间,寻找了Spark github项目中的相关问题,文章和示例,很遗憾没有找到解决方案。有一些示例可以合并非Direct流,但是那些火花流API太旧了,我犹豫是否采用它们,怀疑以后是否可能需要进行大量的车轮重新发明工作。

下面是我的代码:

package com.fm.data.fmtrade

import org.apache.spark.SparkConf
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.kafka010.KafkaUtils
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.sql.SparkSession

object TestKafkaSparkStreaming {
  def main(args: Array[String]): Unit = {
    var conf=new SparkConf()//.setMaster("spark://192.168.177.120:7077")
      .setAppName("SparkStreamKaflaWordCount Demo")
      .set("spark.streaming.concurrentJobs", "8")
    val ss = SparkSession
      .builder()
      .config(conf)
      .appName(args.mkString(" "))
      .getOrCreate()

    val topicsArr: Array[String] = Array(
      "betadbserver1.copytrading.t_trades",
      "betadbserver1.copytrading.t_users",
      "betadbserver1.account.s_follower",
      "betadbserver1.copytrading.t_followorder",
      "betadbserver1.copytrading.t_follow",
      "betadbserver1.copytrading.t_activefollow",
      "betadbserver1.account.users",
      "betadbserver1.account.user_accounts"
    )

    var group="con-consumer-group111" + (new util.Random).nextInt(10000)
    val kafkaParam = Map(
      "bootstrap.servers" -> "beta-hbase02:9092,beta-hbase03:9092,beta-hbase04:9092",
      "key.deserializer" -> classOf[StringDeserializer],
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> group,
      "auto.offset.reset" -> "earliest",//"latest",
      "enable.auto.commit" -> (false: java.lang.Boolean)
    )
    val ssc = new StreamingContext(ss.sparkContext, Seconds(4))
    //val streams =
      topicsArr.foreach{//.slice(0,1)
      topic =>
        val newTopicsArr=Array(topic)
        val stream=
          KafkaUtils.createDirectStream[String,String](ssc, PreferConsistent, Subscribe[String,String](newTopicsArr,kafkaParam))
        stream.map(s =>(s.key(),s.value())).print();
    }
    /*
    val unifiedStream = ssc.union(streams)
    unifiedStream.repartition(2)
    unifiedStream.map(s =>(s.key(),s.value())).print()
    */
    /*
    unifiedStream.foreachRDD{ rdd =>
      rdd.foreachPartition{ partitionOfRecords =>
        partitionOfRecords.foreach{ record =>

        }
      }
    }
    */
    ssc.start();
    ssc.awaitTermination();
  }

}

0 个答案:

没有答案