嵌套的结构键分解为列值

时间:2019-08-12 18:12:12

标签: scala dataframe apache-spark

在我要处理的JSON对象中,给我一个嵌套的StructType,其中每个键代表一个特定的位置,然后包含货币和价格:

-- id: string (nullable = true)
-- pricingByCountry: struct (nullable = true)
   |-- regionPrices: struct (nullable = true)
   |-- AT: struct (nullable = true)
       |-- currency: string (nullable = true)
       |-- price: double (nullable = true)
   |-- BT: struct (nullable = true)
       |-- currency: string (nullable = true)
       |-- price: double (nullable = true)
   |-- CL: struct (nullable = true)
       |-- currency: string (nullable = true)
       |-- price: double (nullable = true)
...etc.

我想爆炸它,这样我可以为每个国家/地区排上一行,而不是每个国家/地区都有一列:

+---+--------+---------+------+
| id| country| currency| price|
+---+--------+---------+------+
|  0|      AT|      EUR|   100|
|  0|      BT|      NGU|   400|
|  0|      CL|      PES|   200|
+---+--------+---------+------+

这些解决方案直观上很有意义:Spark DataFrame exploding a map with the key as a memberSpark scala - Nested StructType conversion to Map,但是不幸的是,它们不起作用,因为我传递的是列而不是整行要映射。我不想手动映射整行-只是包含嵌套结构的特定列。我想在结构中保留与“ id”相同级别的其他几个属性。

1 个答案:

答案 0 :(得分:0)

我认为可以做到以下几点:

// JSON test data
val ds = Seq("""{"id":"abcd","pricingByCountry":{"regionPrices":{"AT":{"currency":"EUR","price":100.00},"BT":{"currency":"NGE","price":200.00},"CL":{"currency":"PES","price":300.00}}}}""").toDS

val df = spark.read.json(ds)

// Schema to map udf output
val outputSchema = ArrayType(StructType(Seq(
  StructField("country", StringType, false),
  StructField("currency", StringType, false),
  StructField("price", DoubleType, false)
)))

// UDF takes value of `regionPrices` json string and converts
// it to Array of tuple(country, currency, price)
import org.apache.spark.sql._
val toMap = udf((jsonString: String) => {
  import com.fasterxml.jackson.databind._
  import com.fasterxml.jackson.module.scala.DefaultScalaModule

  val jsonMapper = new ObjectMapper()
  jsonMapper.registerModule(DefaultScalaModule)

  val jsonMap = jsonMapper.readValue(jsonString, classOf[Map[String, Map[String, Double]]])
  jsonMap.map(f => (f._1, f._2("currency"), f._2("price"))).toSeq

}, outputSchema)

val result = df.
              select(col("id").as("id"), explode(toMap(to_json(col("pricingByCountry.regionPrices")))).as("temp")).
              select(col("id"), col("temp.country").as("country"), col("temp.currency").as("currency"), col("temp.price").as("price"))

输出将是:

scala> result.show
+----+-------+--------+-----+
|  id|country|currency|price|
+----+-------+--------+-----+
|abcd|     AT|     EUR|100.0|
|abcd|     BT|     NGE|200.0|
|abcd|     CL|     PES|300.0|
+----+-------+--------+-----+