我编写了一个代码来将kafka中的记录提取到spark中。我遇到了一些奇怪的行为。它以不一致的顺序消耗。
val conf = new SparkConf()
.setAppName("Test Data")
.set("spark.cassandra.connection.host", "192.168.0.40")
.set("spark.cassandra.connection.keep_alive_ms", "20000")
.set("spark.executor.memory", "1g")
.set("spark.driver.memory", "2g")
.set("spark.submit.deployMode", "cluster")
.set("spark.executor.instances", "4")
.set("spark.executor.cores", "3")
.set("spark.cores.max", "12")
.set("spark.driver.cores", "4")
.set("spark.ui.port", "4040")
.set("spark.streaming.backpressure.enabled", "true")
.set("spark.streaming.kafka.maxRatePerPartition", "30")
.set("spark.local.dir", "//tmp//")
.set("spark.sql.warehouse.dir", "/tmp/hive/")
.set("hive.exec.scratchdir", "/tmp/hive2")
val spark = SparkSession
.builder
.appName("Test Data")
.config(conf)
.getOrCreate()
import spark.implicits._
val sc = SparkContext.getOrCreate(conf)
val ssc = new StreamingContext(sc, Seconds(10))
val topics = Map("topictest" -> 1)
val kafkaParams = Map[String, String](
"zookeeper.connect" -> "192.168.0.40:2181",
"group.id" -> "=groups",
"auto.offset.reset" -> "smallest")
val kafkaStream = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics, StorageLevel.MEMORY_AND_DISK_SER)
}
kafkaStream.foreachRDD(rdd =>
{
if (!rdd.partitions.isEmpty) {
try {
println("Count of rows " + rdd.count())
} catch {
case e: Exception => e.printStackTrace
}
} else {
println("blank rdd")
}
})
所以,最初我在卡夫卡制作了1000万张唱片。现在生产者停止,然后启动Spark Consumer Application。我检查了Spark UI,最初我每批收到700,000-900,000条记录(每10秒),然后开始每批获得4-6K记录。所以想要理解为什么尽管数据存在于Kafka中,但是获取计数下降得非常糟糕,所以不是每批次提供4k,而是直接向消费者开放大批量产品。可以做什么以及如何做?
谢谢,