将spark数据帧Row转换为Avro并发布到kakfa

时间:2018-03-22 19:24:57

标签: apache-spark dataframe apache-kafka spark-dataframe spark-streaming

我有一个带有以下架构的火花数据框,并尝试使用Avro将此数据框传输到Kafka

```root
 |-- clientTag: struct (nullable = true)
 |    |-- key: string (nullable = true)
 |-- contactPoint: struct (nullable = true)
 |    |-- email: string (nullable = true)
 |    |-- type: string (nullable = true)
 |-- performCheck: string (nullable = true)```

示例记录:{"performCheck" : "N", "clientTag" :{"key":"value"}, "contactPoint": {"email":"abc@gmail.com", "type":"EML"}}

Avro架构:

{ "name":"Message", "namespace":"kafka.sample.avro", "type":"record", "fields":[ {"type":"string", "name":"id"}, {"type":"string", "name":"email"} {"type":"string", "name":"type"} ] }

我有几个问题。

  1. org.apache.spark.sql.Row转换为Avro消息的最佳方法是什么,因为我想从每个行的数据帧中提取emailtype,并使用这些值构建Avro消息?
  2. 最终,所有Avro消息都将发送给Kafka。因此,如果在制作过程中出现错误,我如何收集未能生成给Kafka的所有Row并返回数据帧?
  3. 感谢您的帮助

1 个答案:

答案 0 :(得分:0)

你可以试试这个。

问题1:您可以使用点表示法提取数据框的子元素:

  val dfJSON = spark.read.json("/json/path/sample_avro_data_as_json.json") //can read from schema registry
    .withColumn("id", $"clientTag.key")
    .withColumn("email", $"contactPoint.email")
    .withColumn("type", $"contactPoint.type")

然后,您可以直接使用这些列,同时将值分配给您序列化的Avro记录&发送给卡夫卡。

问题2:你可以追踪成功与否这样的失败。这不是完全有效的代码,但可以给你一个想法。

  dfJSON.foreachPartition( currentPartition => {

    var producer = new KafkaProducer[String, Array[Byte]](props)
    var schema: Schema = ...//Get schema from schema registry or avsc file
    val schemaRegProps = Map("schema.registry.url" -> schemaRegistryUrl)
    val client = new CachedSchemaRegistryClient(schemaRegistryUrl, Int.MaxValue)
    valueSerializer = new KafkaAvroSerializer(client)
    valueSerializer.configure(schemaRegProps, false)

    val failedRecDF = currentPartition.map(rec =>{
      try {

        var avroRecord: GenericRecord = new GenericData.Record(schema)
        avroRecord.put("id", rec.getAs[String]("id"))
        avroRecord.put("email", rec.getAs[String]("email"))
        avroRecord.put("type", rec.getAs[String]("type"))

        // Serialize record in Producer record & send to Kafka

        producer.send(new ProducerRecord[String, Array[Byte]](kafkaTopic, rec.getAs[String]("id").toString(), valueSerializer.serialize(kafkaTopic, avroRecord).toArray))
        (rec.getAs[String]("id"), rec.getAs[String]("email"), rec.getAs[String]("type"), "Success")
      }catch{
        case e: Exception => println("*** Exception *** ")
          e.printStackTrace()

          (rec.getAs[String]("id"), rec.getAs[String]("email"), rec.getAs[String]("type"), "Failed")
      }

    })//.toDF("id", "email", "type", "sent_status")

    failedRecDF.foreach(println)
    //You can retry or log them
  })

回应是:

(111,abc@gmail.com,EML,Success)

你可以做任何你想做的事。