在Kafka Spark流媒体中读取和处理并行性

时间:2017-09-16 05:24:29

标签: scala apache-spark apache-kafka spark-streaming kafka-consumer-api

我正在尝试并行化读取Kafka消息,从而并行处理它们。我的Kafka主题有10个分区。我试图创建5个DStream并应用Union方法来操作单个DStream。这是我到目前为止尝试的代码:

  def main(args: scala.Array[String]): Unit = {

    val properties = readProperties()

    val streamConf = new SparkConf().setMaster("local[2]").setAppName("KafkaStream")
    val ssc = new StreamingContext(streamConf, Seconds(1))
    //  println("defaultParallelism: "+ssc.sparkContext.defaultParallelism)
    ssc.sparkContext.setLogLevel("WARN")
    val numPartitionsOfInputTopic = 5
    val group_id = Random.alphanumeric.take(4).mkString("consumer_group")
    val kafkaStream = {
      val kafkaParams = Map("zookeeper.connect" -> properties.getProperty("zookeeper_connection_str"),
                            "group.id" -> group_id,
                            "zookeeper.connection.timeout.ms" -> "3000")                      

      val streams = (1 to numPartitionsOfInputTopic).map { _ =>
                          KafkaUtils.createStream[scala.Array[Byte], String, DefaultDecoder, StringDecoder](
                          ssc, kafkaParams, Map("kafka_topic" -> 1), StorageLevel.MEMORY_ONLY_SER).map(_._2)
      }
    val unifiedStream = ssc.union(streams)
    val sparkProcessingParallelism = 5 
    unifiedStream.repartition(sparkProcessingParallelism)
    }

  kafkaStream.foreachRDD { x => 
  x.foreach {       
      msg => println("Message: "+msg)
      processMessage(msg)
   }      
  }

  ssc.start()
  ssc.awaitTermination()
}  

执行时,它甚至没有收到单个消息,更不用说进一步处理它了。我在这里错过了什么吗?如果需要,请建议更改。感谢。

1 个答案:

答案 0 :(得分:0)

我强烈建议您切换到Direct Stream。为什么呢?

默认情况下,

Direct Stream会将并行度设置为您在Kafka中的分区数。没有什么必须做 - 只需创建直接流并完成您的工作:)

如果你创建了5个DStreams,你将默认读入5个线程,一个非Direct-DStream =一个线程