Spark Stream kafka stream - 返回的rdd在foreachRDD中没有做任何事情

时间:2017-05-13 01:39:45

标签: scala apache-kafka spark-streaming

我在这里面临一个奇怪的问题,我正在阅读来自kafka的Avro条记录,并尝试对其进行反序列化并将其存储到文件中。我可以从Kafka获取记录,但有些当我尝试在rdd记录上使用函数时它拒绝做任何事情

import java.util.UUID
import io.confluent.kafka.serializers.KafkaAvroDecoder
import com.my.project.avro.AvroDeserializer
import com.my.project.util.SparkJobLogging
import io.confluent.kafka.schemaregistry.client.SchemaRegistryClient
import io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient
import org.apache.spark.streaming.{Minutes, Seconds, StreamingContext, Time}
import org.apache.spark.streaming.kafka._
import kafka.serializer.{DefaultDecoder, StringDecoder}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming.dstream.{DStream}

object KafkaConsumer extends SparkJobLogging {
  var schemaRegistry: SchemaRegistryClient = null
  val url="url:8181"
  schemaRegistry= new CachedSchemaRegistryClient(url, 1000)

  def createKafkaStream(ssc: StreamingContext): DStream[(String,Array[Byte])] = {
    val kafkaParams = Map[String, String](
      "zookeeper.connect" -> "zk.server:2181",
      "group.id" -> s"${UUID.randomUUID().toString}",
      "auto.offset.reset" -> "smallest",
      "bootstrap.servers" -> "bootstrap.server:9092",
      "zookeeper.connection.timeout.ms" -> "6000",
      "schema.registry.url" ->"registry.url:8181"
    )

    val topic = "my.topic"
    KafkaUtils.createDirectStream[String, Array[Byte], StringDecoder, DefaultDecoder](ssc, kafkaParams, Set(topic))
  }

   def processRecord(  avroStream: Array[Byte])={
    println(AvroDeserializer.toRecord(avroStream, schemaRegistry) )
  }

  def main(args: Array[String]) = {
    val sparkConf = new SparkConf().setAppName("AvroDeserilizer")
    val sc = new SparkContext(sparkConf)
    val ssc = new StreamingContext(sc, Seconds(5))
    val topicStream = createKafkaStream(ssc)map(_._2)
    topicStream.foreachRDD(
      rdd => if (!rdd.isEmpty()){
        logger.info(rdd.count())
        rdd.foreach(avroRecords=> processRecord(avroRecords))
      }
    )
    ssc.start()
    ssc.awaitTermination()
  }
}

object AvroDeserializer extends SparkJobLogging{
  def toRecord(buffer: Array[Byte], registry: SchemaRegistryClient): GenericRecord = {
    val bb = ByteBuffer.wrap(buffer)
    bb.get() // consume MAGIC_BYTE
    val schemaId = bb.getInt // consume schemaId
    val schema = registry.getByID(schemaId) // consult the Schema Registry
    val reader = new GenericDatumReader[GenericRecord](schema)
    val decoder = DecoderFactory.get().binaryDecoder(buffer, bb.position(), bb.remaining(), null)
    reader.read(null, decoder) //null -> as we are not providing any datum
  }
}

Till语句logger.info(rdd.count())一切正常,我在日志中看到了确切的记录计数。但之后没有任何作用。 当我累了

val record= rdd.first()
processRecord(record)

虽然有效,但rdd.foreach(avroRecords=> processRecord(avroRecords))rdd.map(avroRecords=> processRecord(avroRecords))不起作用。它只是在每个流媒体电话上打印:

 17/05/14 01:01:24 INFO scheduler.DAGScheduler: Job 2 finished: foreach at KafkaConsumer.scala:56, took 42.684999 s
 17/05/14 01:01:24 INFO scheduler.JobScheduler: Finished job streaming job 1494738000000 ms.0 from job set of time 1494738000000 ms
 17/05/14 01:01:24 INFO scheduler.JobScheduler: Total delay: 84.888 s for time 1494738000000 ms (execution: 84.719 s)
 17/05/14 01:01:24 INFO scheduler.ReceivedBlockTracker: Deleting batches ArrayBuffer()
 17/05/14 01:01:24 INFO scheduler.InputInfoTracker: remove old batch metadata: 
 17/05/14 01:01:26 INFO yarn.YarnAllocator: Canceling requests for 0 executor containers
 17/05/14 01:01:26 WARN yarn.YarnAllocator: Expected to find pending requests, but found none.
 17/05/14 01:01:29 INFO yarn.YarnAllocator: Canceling requests for 0 executor containers
 17/05/14 01:01:29 WARN yarn.YarnAllocator: Expected to find pending requests, but found none.

它只打印日志中的最后两行,直到下一个流式上下文调用。

3 个答案:

答案 0 :(得分:1)

您的println语句正在不在当前进程中的分布式工作程序上运行,因此您不会看到它们。您可以尝试使用println替换log.info验证这一点。

理想情况下,您应该将DStream[Array[Byte]]转为DStream[GenericRecord]并将其写入文件,使用.saveAsTextFiles或其他内容。您可能需要stream.take(),因为流可能是无限的。

http://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams

答案 1 :(得分:1)

return $response->withCookie(cookie()->forever('region', $region));

dstream.foreachRDD是一个功能强大的原语,允许将数据发送到外部系统。但是,了解如何正确有效地使用此原语非常重要。要避免的一些常见错误如下。

DStreams由输出操作延迟执行,就像RDD由RDD操作延迟执行一样。具体而言,DStream输出操作中的RDD操作会强制处理接收到的数据。因此,如果您的应用程序没有任何输出操作,或者像dstream.foreachRDD()这样的输出操作没有任何RDD操作,那么就不会执行任何操作。系统将简单地接收数据并将其丢弃。

答案 2 :(得分:0)

虽然上述方法对我没有用,但我在汇合文档中找到了不同的方法。 KafkaAvroDecoder将与架构注册表通信,获取架构并反序列化数据。因此它不再需要自定义反序列化器。

import io.confluent.kafka.serializers.KafkaAvroDecoder

val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers,
  "schema.registry.url" -> schemaRegistry,
  "key.converter.schema.registry.url" -> schemaRegistry,
  "value.converter.schema.registry.url" -> schemaRegistry,
  "auto.offset.reset" -> "smallest")
val topicSet = Set(topics)
val messages = KafkaUtils.createDirectStream[Object, Object, KafkaAvroDecoder, KafkaAvroDecoder](ssc, kafkaParams, topicSet).map(_._2)
messages.foreachRDD { 
rdd => if (!rdd.isEmpty()){
    logger.info(rdd.count())
    rdd.saveAsTextFile("/data/")
  }
)
ssc.start()
ssc.awaitTermination()
}
}

依赖jar:kafka-avro-serializer-3.1.1.jar。这对我来说非常合适,我希望这将有助于将来的某些人。