我有一个带有以下架构的火花数据框,并尝试使用Avro将此数据框传输到Kafka
```root
|-- clientTag: struct (nullable = true)
| |-- key: string (nullable = true)
|-- contactPoint: struct (nullable = true)
| |-- email: string (nullable = true)
| |-- type: string (nullable = true)
|-- performCheck: string (nullable = true)```
示例记录:{"performCheck" : "N", "clientTag" :{"key":"value"}, "contactPoint": {"email":"abc@gmail.com", "type":"EML"}}
Avro架构:
{
"name":"Message",
"namespace":"kafka.sample.avro",
"type":"record",
"fields":[
{"type":"string", "name":"id"},
{"type":"string", "name":"email"}
{"type":"string", "name":"type"}
]
}
我有几个问题。
org.apache.spark.sql.Row
转换为Avro消息的最佳方法是什么,因为我想从每个行的数据帧中提取email
和type
,并使用这些值构建Avro消息?感谢您的帮助
答案 0 :(得分:0)
你可以试试这个。
问题1:您可以使用点表示法提取数据框的子元素:
val dfJSON = spark.read.json("/json/path/sample_avro_data_as_json.json") //can read from schema registry
.withColumn("id", $"clientTag.key")
.withColumn("email", $"contactPoint.email")
.withColumn("type", $"contactPoint.type")
然后,您可以直接使用这些列,同时将值分配给您序列化的Avro记录&发送给卡夫卡。
问题2:你可以追踪成功与否这样的失败。这不是完全有效的代码,但可以给你一个想法。 dfJSON.foreachPartition( currentPartition => {
var producer = new KafkaProducer[String, Array[Byte]](props)
var schema: Schema = ...//Get schema from schema registry or avsc file
val schemaRegProps = Map("schema.registry.url" -> schemaRegistryUrl)
val client = new CachedSchemaRegistryClient(schemaRegistryUrl, Int.MaxValue)
valueSerializer = new KafkaAvroSerializer(client)
valueSerializer.configure(schemaRegProps, false)
val failedRecDF = currentPartition.map(rec =>{
try {
var avroRecord: GenericRecord = new GenericData.Record(schema)
avroRecord.put("id", rec.getAs[String]("id"))
avroRecord.put("email", rec.getAs[String]("email"))
avroRecord.put("type", rec.getAs[String]("type"))
// Serialize record in Producer record & send to Kafka
producer.send(new ProducerRecord[String, Array[Byte]](kafkaTopic, rec.getAs[String]("id").toString(), valueSerializer.serialize(kafkaTopic, avroRecord).toArray))
(rec.getAs[String]("id"), rec.getAs[String]("email"), rec.getAs[String]("type"), "Success")
}catch{
case e: Exception => println("*** Exception *** ")
e.printStackTrace()
(rec.getAs[String]("id"), rec.getAs[String]("email"), rec.getAs[String]("type"), "Failed")
}
})//.toDF("id", "email", "type", "sent_status")
failedRecDF.foreach(println)
//You can retry or log them
})
回应是:
(111,abc@gmail.com,EML,Success)
你可以做任何你想做的事。