我是来自Event hub捕获的Avro文件,它具有body属性,它的字符串以二进制格式序列化,我想将其解码为字符串,以便进一步将其解析为JSON。
有什么帮助吗?
谢谢。
答案 0 :(得分:0)
一旦您从事件中心阅读了消息,我就假设您具有GenericRecord。 GenericRecord的.get
将返回一个AnyRef
,您可以将其转换为Array[Byte]
。此字节数组,可用于实例化json字符串。
val bodyBytes = record.get("body").asInstanceOf[Array[Byte]]
val body = new String(bodyBytes) // assumes UTF-8 encoding
现在,您的body
是以JSON字符串的形式出现的,您可以将JSON反序列化为Map或其他类型。
这是我创建的测试:
import org.scalatest.{FlatSpec, Matchers}
class AvroSpec extends FlatSpec with Matchers {
import org.apache.avro.Schema
import org.apache.avro.generic.GenericData
import org.json4s._
import org.json4s.jackson.JsonMethods._
val schemaString =
"""
{
"type": "record",
"namespace": "com.example",
"name": "EventMessage",
"fields": [
{ "name": "body", "type": "string" }
]
}
"""
val schema: Schema = new Schema.Parser().parse(schemaString)
val record = new GenericData.Record(schema)
record.put("body",
"""
{
"user": "Ankit Gupt",
"languages": ["scala", "java", "js"]
}
""".getBytes)
"Avro" should "decode bytes and parse json" in {
// get field, convert raw byte string and decode
val bodyBytes = record.get("body").asInstanceOf[Array[Byte]]
val bodyStr = new String(bodyBytes)
// parse json and extract
// implicit formats used by json4s
implicit val formats = DefaultFormats
val bodyMap = parse(bodyStr).extract[Map[String, Any]]
bodyMap("user") should equal("Ankit Gupt")
}
}
答案 1 :(得分:0)
// Parse Body of AVRO file
// 1. Select Body column in Binary Format
// 2. Cast the Cody Column as String
// 3. Use map function of RDD to convert each element in list to Sting format
val bodyRDD = df.select(col("Body").cast("string")).rdd.map(x=>x(0).toString())
// Read parsed string in JSON format
val data = spark.read.json(bodyRDD)