我使用了一个mysql CDC(更改数据捕获)系统,该系统可以捕获插入到mysql表中并发送到kafka的新记录,每个表一个主题。然后在Spark Streaming中,我想与Spark Streaming DStream API并行接收多条有关Kafka主题的消息,以进一步处理这些mysql表中这些变化的数据。
CDC的设置就很好了,kafka-consume-topic.sh对所有表的消息进行了测试,都可以接收到。但是在星型流中,我只能收到一张表的消息。但是,如果仅在应用程序中为所有表的测试一一创建了一个主题/流,则单独测试所有表都可以通过Spark Streaming读取。我搜索了很长时间,寻找了Spark github项目中的相关问题,文章和示例,很遗憾没有找到解决方案。有一些示例可以合并非Direct流,但是那些火花流API太旧了,我犹豫是否采用它们,怀疑以后是否可能需要进行大量的车轮重新发明工作。
下面是我的代码:
package com.fm.data.fmtrade
import org.apache.spark.SparkConf
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.kafka010.KafkaUtils
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.sql.SparkSession
object TestKafkaSparkStreaming {
def main(args: Array[String]): Unit = {
var conf=new SparkConf()//.setMaster("spark://192.168.177.120:7077")
.setAppName("SparkStreamKaflaWordCount Demo")
.set("spark.streaming.concurrentJobs", "8")
val ss = SparkSession
.builder()
.config(conf)
.appName(args.mkString(" "))
.getOrCreate()
val topicsArr: Array[String] = Array(
"betadbserver1.copytrading.t_trades",
"betadbserver1.copytrading.t_users",
"betadbserver1.account.s_follower",
"betadbserver1.copytrading.t_followorder",
"betadbserver1.copytrading.t_follow",
"betadbserver1.copytrading.t_activefollow",
"betadbserver1.account.users",
"betadbserver1.account.user_accounts"
)
var group="con-consumer-group111" + (new util.Random).nextInt(10000)
val kafkaParam = Map(
"bootstrap.servers" -> "beta-hbase02:9092,beta-hbase03:9092,beta-hbase04:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> group,
"auto.offset.reset" -> "earliest",//"latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val ssc = new StreamingContext(ss.sparkContext, Seconds(4))
//val streams =
topicsArr.foreach{//.slice(0,1)
topic =>
val newTopicsArr=Array(topic)
val stream=
KafkaUtils.createDirectStream[String,String](ssc, PreferConsistent, Subscribe[String,String](newTopicsArr,kafkaParam))
stream.map(s =>(s.key(),s.value())).print();
}
/*
val unifiedStream = ssc.union(streams)
unifiedStream.repartition(2)
unifiedStream.map(s =>(s.key(),s.value())).print()
*/
/*
unifiedStream.foreachRDD{ rdd =>
rdd.foreachPartition{ partitionOfRecords =>
partitionOfRecords.foreach{ record =>
}
}
}
*/
ssc.start();
ssc.awaitTermination();
}
}