Spark结构化流媒体kafka avro Producer

时间:2018-05-24 14:26:10

标签: apache-spark kafka-producer-api spark-structured-streaming

我有一个数据框让我们说:

val someDF = Seq(
  (8, "bat"),
  (64, "mouse"),
  (-27, "horse")
).toDF("number", "word")

我想使用avro序列化和使用架构注册表将该数据帧发送到kafka主题。我相信我几乎就在那里,但我似乎无法通过任务不可序列化的错误。我知道kafka有一个接收器,但它不与模式注册表通信,这是一个要求。

object Holder extends Serializable{
  def prop(): java.util.Properties = {
    val props = new Properties()
    props.put("schema.registry.url", schemaRegistryURL)
    props.put("key.serializer", classOf[KafkaAvroSerializer].getCanonicalName)
    props.put("value.serializer", classOf[KafkaAvroSerializer].getCanonicalName)
    props.put("schema.registry.url", schemaRegistryURL)
    props.put("bootstrap.servers", brokers)
    props
  }

  def vProps(props: java.util.Properties): kafka.utils.VerifiableProperties = {
    val vProps = new kafka.utils.VerifiableProperties(props)
  vProps
  }

  def messageSchema(vProps: kafka.utils.VerifiableProperties): org.apache.avro.Schema = {
    val ser = new KafkaAvroEncoder(vProps)
    val avro_schema = new RestService(schemaRegistryURL).getLatestVersion(subjectValueName)
    val messageSchema = new Schema.Parser().parse(avro_schema.getSchema)
    messageSchema
  }

  def avroRecord(messageSchema: org.apache.avro.Schema): org.apache.avro.generic.GenericData.Record = {
    val avroRecord = new GenericData.Record(messageSchema)
    avroRecord
  }

  def ProducerRecord(avroRecord:org.apache.avro.generic.GenericData.Record): org.apache.kafka.clients.producer.ProducerRecord[org.apache.avro.generic.GenericRecord,org.apache.avro.generic.GenericRecord] = {
    val record = new ProducerRecord[GenericRecord, GenericRecord](topicWrite, avroRecord)
    record
  }

  def producer(props: java.util.Properties): KafkaProducer[GenericRecord, GenericRecord] = {
    val producer = new KafkaProducer[GenericRecord, GenericRecord](props)
    producer
  }
}

val prod:  (String, String) => String = (
  number: String,
  word: String,
   ) => {
  val prop = Holder.prop()
  val vProps = Holder.vProps(prop)
  val mSchema = Holder.messageSchema(vProps)
  val aRecord = Holder.avroRecord(mSchema)
  aRecord.put("number", number)
  aRecord.put("word", word)
  val record = Holder.ProducerRecord(aRecord)
  val producer = Holder.producer(prop)
  producer.send(record)
  "sent"
}

val prodUDF: org.apache.spark.sql.expressions.UserDefinedFunction =
  udf((
  Number: String,
  word: String,
 ) => prod(number,word))


val testDF = firstDF.withColumn("sent", prodUDF(col("number"), col("word")))

1 个答案:

答案 0 :(得分:0)

KafkaProducer无法序列化。 在prod()内部创建KafkaProducer,而不是在外部创建。