火花卡夫卡消费者获取的不一致

时间:2018-05-26 09:18:00

标签: scala apache-spark apache-kafka spark-streaming kafka-consumer-api

我编写了一个代码来将kafka中的记录提取到spark中。我遇到了一些奇怪的行为。它以不一致的顺序消耗。

 val conf = new SparkConf()
  .setAppName("Test Data")
  .set("spark.cassandra.connection.host", "192.168.0.40")
  .set("spark.cassandra.connection.keep_alive_ms", "20000")
  .set("spark.executor.memory", "1g")
  .set("spark.driver.memory", "2g")
  .set("spark.submit.deployMode", "cluster")
  .set("spark.executor.instances", "4")
  .set("spark.executor.cores", "3")
  .set("spark.cores.max", "12")
  .set("spark.driver.cores", "4")
  .set("spark.ui.port", "4040")
  .set("spark.streaming.backpressure.enabled", "true")
  .set("spark.streaming.kafka.maxRatePerPartition", "30")
  .set("spark.local.dir", "//tmp//")
  .set("spark.sql.warehouse.dir", "/tmp/hive/")
  .set("hive.exec.scratchdir", "/tmp/hive2")

val spark = SparkSession
  .builder
  .appName("Test Data")
  .config(conf)
  .getOrCreate()

import spark.implicits._

val sc = SparkContext.getOrCreate(conf)
val ssc = new StreamingContext(sc, Seconds(10))
val topics = Map("topictest" -> 1)
val kafkaParams = Map[String, String](
  "zookeeper.connect" -> "192.168.0.40:2181",
  "group.id" -> "=groups",
  "auto.offset.reset" -> "smallest")

val kafkaStream =  KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics, StorageLevel.MEMORY_AND_DISK_SER)
}

kafkaStream.foreachRDD(rdd =>
  {
    if (!rdd.partitions.isEmpty) {
      try {
        println("Count of rows " + rdd.count())
      } catch {
        case e: Exception => e.printStackTrace
      }
    } else {
      println("blank rdd")
    }
  }) 

所以,最初我在卡夫卡制作了1000万张唱片。现在生产者停止,然后启动Spark Consumer Application。我检查了Spark UI,最初我每批收到700,000-900,000条记录(每10秒),然后开始每批获得4-6K记录。所以想要理解为什么尽管数据存在于Kafka中,但是获取计数下降得非常糟糕,所以不是每批次提供4k,而是直接向消费​​者开放大批量产品。可以做什么以及如何做?

谢谢,

0 个答案:

没有答案