Kafka spark directStream无法获取数据

时间:2015-08-13 03:24:22

标签: apache-spark apache-kafka spark-streaming

我使用spark directStream api从Kafka读取数据。我的代码如下:

val sparkConf = new SparkConf().setAppName("testdirectStreaming")
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Seconds(2))

val kafkaParams = Map[String, String](
    "auto.offset.reset" -> "smallest",
    "metadata.broker.list"->"10.0.0.11:9092",
    "spark.streaming.kafka.maxRatePerPartition"->"100"
)
//I set all of the 3 partitions fromOffset are 0
var fromOffsets:Map[TopicAndPartition, Long] = Map(TopicAndPartition("mytopic",0) -> 0)
fromOffsets+=(TopicAndPartition("mytopic",1) -> 0)
fromOffsets+=(TopicAndPartition("mytopic",2) -> 0)

val kafkaData = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, MessageAndMetadata[String, String]](
ssc, kafkaParams, fromOffsets,(mmd: MessageAndMetadata[String, String]) => mmd)

var offsetRanges = Array[OffsetRange]()
kafkaData.transform { rdd =>
    offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
    rdd
}.map {
    _.message()
}.foreachRDD { rdd =>
    for (o <- offsetRanges) {
        println(s"---${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
    }
    rdd.foreachPartition{ partitionOfRecords =>
        partitionOfRecords.foreach { line =>
            println("===============value:"+line)
        }
    }
}

我确定kafka群集中有数据,但我的代码无法获取任何数据。提前谢谢。

1 个答案:

答案 0 :(得分:3)

我找到了原因:自保留期到期以来,kafka中的旧邮件已被删除。因此,当我将fromOffset设置为0时,它会导致OutOfOffSet异常。异常导致Spark使用最新版本重置偏移量。因此我无法得到任何消息。解决方案是我需要设置适当的fromOffset以避免异常。