在我要处理的JSON对象中,给我一个嵌套的StructType,其中每个键代表一个特定的位置,然后包含货币和价格:
-- id: string (nullable = true)
-- pricingByCountry: struct (nullable = true)
|-- regionPrices: struct (nullable = true)
|-- AT: struct (nullable = true)
|-- currency: string (nullable = true)
|-- price: double (nullable = true)
|-- BT: struct (nullable = true)
|-- currency: string (nullable = true)
|-- price: double (nullable = true)
|-- CL: struct (nullable = true)
|-- currency: string (nullable = true)
|-- price: double (nullable = true)
...etc.
我想爆炸它,这样我可以为每个国家/地区排上一行,而不是每个国家/地区都有一列:
+---+--------+---------+------+
| id| country| currency| price|
+---+--------+---------+------+
| 0| AT| EUR| 100|
| 0| BT| NGU| 400|
| 0| CL| PES| 200|
+---+--------+---------+------+
这些解决方案直观上很有意义:Spark DataFrame exploding a map with the key as a member和Spark scala - Nested StructType conversion to Map,但是不幸的是,它们不起作用,因为我传递的是列而不是整行要映射。我不想手动映射整行-只是包含嵌套结构的特定列。我想在结构中保留与“ id”相同级别的其他几个属性。
答案 0 :(得分:0)
我认为可以做到以下几点:
// JSON test data
val ds = Seq("""{"id":"abcd","pricingByCountry":{"regionPrices":{"AT":{"currency":"EUR","price":100.00},"BT":{"currency":"NGE","price":200.00},"CL":{"currency":"PES","price":300.00}}}}""").toDS
val df = spark.read.json(ds)
// Schema to map udf output
val outputSchema = ArrayType(StructType(Seq(
StructField("country", StringType, false),
StructField("currency", StringType, false),
StructField("price", DoubleType, false)
)))
// UDF takes value of `regionPrices` json string and converts
// it to Array of tuple(country, currency, price)
import org.apache.spark.sql._
val toMap = udf((jsonString: String) => {
import com.fasterxml.jackson.databind._
import com.fasterxml.jackson.module.scala.DefaultScalaModule
val jsonMapper = new ObjectMapper()
jsonMapper.registerModule(DefaultScalaModule)
val jsonMap = jsonMapper.readValue(jsonString, classOf[Map[String, Map[String, Double]]])
jsonMap.map(f => (f._1, f._2("currency"), f._2("price"))).toSeq
}, outputSchema)
val result = df.
select(col("id").as("id"), explode(toMap(to_json(col("pricingByCountry.regionPrices")))).as("temp")).
select(col("id"), col("temp.country").as("country"), col("temp.currency").as("currency"), col("temp.price").as("price"))
输出将是:
scala> result.show
+----+-------+--------+-----+
| id|country|currency|price|
+----+-------+--------+-----+
|abcd| AT| EUR|100.0|
|abcd| BT| NGE|200.0|
|abcd| CL| PES|300.0|
+----+-------+--------+-----+