我有一个只有1个类型为map的字段的模式:
val jsonSchema =
s"""
|{"type": "record",
| "name": "MyEvent",
| "namespace": "my.event",
| "fields": [
| {"name": "fields",
| "doc": "The event field values",
| "type": {"type": "map", "values": ["int", "long", "float", "double", "string", "boolean"]}
| }]
| }""".stripMargin
当我尝试使用上述模式使用from_avro对事件进行反序列化时,我得到如下输出:
+----------------------------------------------------+
|event |
+----------------------------------------------------+
|[[a -> [1,,,,,], b -> [,,,, r,], c -> [,,,,, true]]]|
+----------------------------------------------------+
如何将这种结构转换为简单的Map [String,Any]?
复制步骤:
1)我正在创建上述类型的记录->
val parser = new Parser()
val schema = parser.parse(jsonSchema)
val out = new ByteArrayOutputStream()
val encoder = EncoderFactory.get.binaryEncoder(out, null)
val map = Map("a" -> 1, "b" -> "r", "c" -> true).asJava //Map has different types of values
val record = new GenericData.Record(schema)
record.put("fields", map)
val datumWriter = new GenericDatumWriter[GenericRecord](schema)
datumWriter.write(record, encoder)
encoder.flush()
out.close()
val serializedBytes = out.toByteArray
2)现在将其添加到DF:
import spark.implicits._
val df = spark.sparkContext.parallelize(List(serializedBytes)).toDF()
**df.printSchema()**
root
|-- value: binary (nullable = true)
3)尝试使用from_avro反序列化:
val dfDeser = df.withColumn("event", from_avro(df("value"), jsonSchema))
**dfDeser.printSchema()**
oot
|-- value: binary (nullable = true)
|-- event: struct (nullable = true)
| |-- fields: map (nullable = false)
| | |-- key: string
| | |-- value: struct (valueContainsNull = false)
| | | |-- member0: integer (nullable = true)
| | | |-- member1: long (nullable = true)
| | | |-- member2: float (nullable = true)
| | | |-- member3: double (nullable = true)
| | | |-- member4: string (nullable = true)
| | | |-- member5: boolean (nullable = true)
**dfDeser.show(10,false)**
+----------------------------------------------------+
|event |
+----------------------------------------------------+
|[[a -> [1,,,,,], b -> [,,,, r,], c -> [,,,,, true]]]|
+----------------------------------------------------+
修改1: 我正在使用以下代码将其转换为简单的Map:
dfDeser.select("event").map(row => {
val eventRow = row.getAs[Row]("event")
val fieldsMap = eventRow.getAs[Map[String, Row]]("fields")
val simpleFieldMap = fieldsMap.mapValues {row =>
val values = for(i <- 0 to row.length-1) yield { Option(row.get(i)).map(_.toString) }
values.flatten.headOption.getOrElse("")
}
simpleFieldMap
}).show(10,false)
+---------------------------+
|value |
+---------------------------+
|[a -> 1, b -> r, c -> true]|
+---------------------------+
有没有更简单的方法来实现这一目标?