我在这里面临一个奇怪的问题,我正在阅读来自kafka的Avro
条记录,并尝试对其进行反序列化并将其存储到文件中。我可以从Kafka获取记录,但有些当我尝试在rdd记录上使用函数时它拒绝做任何事情
import java.util.UUID
import io.confluent.kafka.serializers.KafkaAvroDecoder
import com.my.project.avro.AvroDeserializer
import com.my.project.util.SparkJobLogging
import io.confluent.kafka.schemaregistry.client.SchemaRegistryClient
import io.confluent.kafka.schemaregistry.client.CachedSchemaRegistryClient
import org.apache.spark.streaming.{Minutes, Seconds, StreamingContext, Time}
import org.apache.spark.streaming.kafka._
import kafka.serializer.{DefaultDecoder, StringDecoder}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming.dstream.{DStream}
object KafkaConsumer extends SparkJobLogging {
var schemaRegistry: SchemaRegistryClient = null
val url="url:8181"
schemaRegistry= new CachedSchemaRegistryClient(url, 1000)
def createKafkaStream(ssc: StreamingContext): DStream[(String,Array[Byte])] = {
val kafkaParams = Map[String, String](
"zookeeper.connect" -> "zk.server:2181",
"group.id" -> s"${UUID.randomUUID().toString}",
"auto.offset.reset" -> "smallest",
"bootstrap.servers" -> "bootstrap.server:9092",
"zookeeper.connection.timeout.ms" -> "6000",
"schema.registry.url" ->"registry.url:8181"
)
val topic = "my.topic"
KafkaUtils.createDirectStream[String, Array[Byte], StringDecoder, DefaultDecoder](ssc, kafkaParams, Set(topic))
}
def processRecord( avroStream: Array[Byte])={
println(AvroDeserializer.toRecord(avroStream, schemaRegistry) )
}
def main(args: Array[String]) = {
val sparkConf = new SparkConf().setAppName("AvroDeserilizer")
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Seconds(5))
val topicStream = createKafkaStream(ssc)map(_._2)
topicStream.foreachRDD(
rdd => if (!rdd.isEmpty()){
logger.info(rdd.count())
rdd.foreach(avroRecords=> processRecord(avroRecords))
}
)
ssc.start()
ssc.awaitTermination()
}
}
object AvroDeserializer extends SparkJobLogging{
def toRecord(buffer: Array[Byte], registry: SchemaRegistryClient): GenericRecord = {
val bb = ByteBuffer.wrap(buffer)
bb.get() // consume MAGIC_BYTE
val schemaId = bb.getInt // consume schemaId
val schema = registry.getByID(schemaId) // consult the Schema Registry
val reader = new GenericDatumReader[GenericRecord](schema)
val decoder = DecoderFactory.get().binaryDecoder(buffer, bb.position(), bb.remaining(), null)
reader.read(null, decoder) //null -> as we are not providing any datum
}
}
Till语句logger.info(rdd.count())
一切正常,我在日志中看到了确切的记录计数。但之后没有任何作用。
当我累了
val record= rdd.first()
processRecord(record)
虽然有效,但rdd.foreach(avroRecords=> processRecord(avroRecords))
和rdd.map(avroRecords=> processRecord(avroRecords))
不起作用。它只是在每个流媒体电话上打印:
17/05/14 01:01:24 INFO scheduler.DAGScheduler: Job 2 finished: foreach at KafkaConsumer.scala:56, took 42.684999 s
17/05/14 01:01:24 INFO scheduler.JobScheduler: Finished job streaming job 1494738000000 ms.0 from job set of time 1494738000000 ms
17/05/14 01:01:24 INFO scheduler.JobScheduler: Total delay: 84.888 s for time 1494738000000 ms (execution: 84.719 s)
17/05/14 01:01:24 INFO scheduler.ReceivedBlockTracker: Deleting batches ArrayBuffer()
17/05/14 01:01:24 INFO scheduler.InputInfoTracker: remove old batch metadata:
17/05/14 01:01:26 INFO yarn.YarnAllocator: Canceling requests for 0 executor containers
17/05/14 01:01:26 WARN yarn.YarnAllocator: Expected to find pending requests, but found none.
17/05/14 01:01:29 INFO yarn.YarnAllocator: Canceling requests for 0 executor containers
17/05/14 01:01:29 WARN yarn.YarnAllocator: Expected to find pending requests, but found none.
它只打印日志中的最后两行,直到下一个流式上下文调用。
答案 0 :(得分:1)
您的println
语句正在不在当前进程中的分布式工作程序上运行,因此您不会看到它们。您可以尝试使用println
替换log.info
验证这一点。
理想情况下,您应该将DStream[Array[Byte]]
转为DStream[GenericRecord]
并将其写入文件,使用.saveAsTextFiles
或其他内容。您可能需要stream.take()
,因为流可能是无限的。
http://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams
答案 1 :(得分:1)
return $response->withCookie(cookie()->forever('region', $region));
dstream.foreachRDD是一个功能强大的原语,允许将数据发送到外部系统。但是,了解如何正确有效地使用此原语非常重要。要避免的一些常见错误如下。
DStreams由输出操作延迟执行,就像RDD由RDD操作延迟执行一样。具体而言,DStream输出操作中的RDD操作会强制处理接收到的数据。因此,如果您的应用程序没有任何输出操作,或者像dstream.foreachRDD()这样的输出操作没有任何RDD操作,那么就不会执行任何操作。系统将简单地接收数据并将其丢弃。
答案 2 :(得分:0)
虽然上述方法对我没有用,但我在汇合文档中找到了不同的方法。 KafkaAvroDecoder
将与架构注册表通信,获取架构并反序列化数据。因此它不再需要自定义反序列化器。
import io.confluent.kafka.serializers.KafkaAvroDecoder
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers,
"schema.registry.url" -> schemaRegistry,
"key.converter.schema.registry.url" -> schemaRegistry,
"value.converter.schema.registry.url" -> schemaRegistry,
"auto.offset.reset" -> "smallest")
val topicSet = Set(topics)
val messages = KafkaUtils.createDirectStream[Object, Object, KafkaAvroDecoder, KafkaAvroDecoder](ssc, kafkaParams, topicSet).map(_._2)
messages.foreachRDD {
rdd => if (!rdd.isEmpty()){
logger.info(rdd.count())
rdd.saveAsTextFile("/data/")
}
)
ssc.start()
ssc.awaitTermination()
}
}
依赖jar:kafka-avro-serializer-3.1.1.jar
。这对我来说非常合适,我希望这将有助于将来的某些人。