Spark反序列化行为对于from_avro很奇怪

时间:2020-05-20 17:08:02

标签: avro spark-avro

我有一个只有1个类型为map的字段的模式:

val jsonSchema =
    s"""
       |{"type": "record",
       | "name": "MyEvent",
       | "namespace": "my.event",
       | "fields": [
       |  {"name": "fields",
       |  "doc": "The event field values",
       |  "type": {"type": "map", "values": ["int", "long", "float", "double", "string", "boolean"]}
       |  }]
       |  }""".stripMargin

当我尝试使用上述模式使用from_avro对事件进行反序列化时,我得到如下输出:

+----------------------------------------------------+
|event                                               |
+----------------------------------------------------+
|[[a -> [1,,,,,], b -> [,,,, r,], c -> [,,,,, true]]]|
+----------------------------------------------------+

如何将这种结构转换为简单的Map [String,Any]?

复制步骤:

1)我正在创建上述类型的记录->

val parser = new Parser()
val schema = parser.parse(jsonSchema)
val out = new ByteArrayOutputStream()
val encoder = EncoderFactory.get.binaryEncoder(out, null)


val map = Map("a" -> 1, "b" -> "r", "c" -> true).asJava //Map has different types of values
val record = new GenericData.Record(schema)
record.put("fields", map)
val datumWriter = new GenericDatumWriter[GenericRecord](schema)

datumWriter.write(record, encoder)
encoder.flush()
out.close()
val serializedBytes = out.toByteArray

2)现在将其添加到DF:

import spark.implicits._
val df = spark.sparkContext.parallelize(List(serializedBytes)).toDF()
**df.printSchema()**
root
  |-- value: binary (nullable = true)

3)尝试使用from_avro反序列化:

val dfDeser = df.withColumn("event", from_avro(df("value"), jsonSchema))
**dfDeser.printSchema()**
oot
 |-- value: binary (nullable = true)
 |-- event: struct (nullable = true)
 |    |-- fields: map (nullable = false)
 |    |    |-- key: string
 |    |    |-- value: struct (valueContainsNull = false)
 |    |    |    |-- member0: integer (nullable = true)
 |    |    |    |-- member1: long (nullable = true)
 |    |    |    |-- member2: float (nullable = true)
 |    |    |    |-- member3: double (nullable = true)
 |    |    |    |-- member4: string (nullable = true)
 |    |    |    |-- member5: boolean (nullable = true)

**dfDeser.show(10,false)**
+----------------------------------------------------+
|event                                               |
+----------------------------------------------------+
|[[a -> [1,,,,,], b -> [,,,, r,], c -> [,,,,, true]]]|
+----------------------------------------------------+

修改1: 我正在使用以下代码将其转换为简单的Map:

dfDeser.select("event").map(row => {
  val eventRow = row.getAs[Row]("event")
  val fieldsMap = eventRow.getAs[Map[String, Row]]("fields")

  val simpleFieldMap = fieldsMap.mapValues {row =>
    val values = for(i <- 0 to row.length-1) yield { Option(row.get(i)).map(_.toString) }
    values.flatten.headOption.getOrElse("")
  }
  simpleFieldMap
}).show(10,false)

+---------------------------+
|value                      |
+---------------------------+
|[a -> 1, b -> r, c -> true]|
+---------------------------+

有没有更简单的方法来实现这一目标?

0 个答案:

没有答案