我需要帮助,使用Apache Spark / Scala将平面数据集转换为嵌套格式。
是否可以自动创建从输入列名称空间派生的嵌套结构
[级别1] 。 [级别2] ?在我的示例中,嵌套级别由列标题中的句点符号'。'确定。
我认为使用map函数可以实现这一点。我愿意接受其他解决方案,特别是如果有更优雅的方式来实现相同结果的话。
package org.acme.au
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SQLContext
import scala.collection.Seq
object testNestedObject extends App {
// Configure spark
val spark = SparkSession.builder()
.appName("Spark batch demo")
.master("local[*]")
.config("spark.driver.host", "localhost")
.getOrCreate()
// Start spark
val sc = spark.sparkContext
sc.setLogLevel("ERROR")
val sqlContext = new SQLContext(sc)
// Define schema for input data
val flatSchema = new StructType()
.add(StructField("id", StringType, false))
.add(StructField("name", StringType, false))
.add(StructField("custom_fields.fav_colour", StringType, true))
.add(StructField("custom_fields.star_sign", StringType, true))
// Create a row with dummy data
val row1 = Row("123456", "John Citizen", "Blue", "Scorpio")
val row2 = Row("990087", "Jane Simth", "Green", "Taurus")
val flatData = Seq(row1, row2)
// Convert into dataframe
val dfIn = spark.createDataFrame(spark.sparkContext.parallelize(flatData), flatSchema)
// Print to console
dfIn.printSchema()
dfIn.show()
// Convert flat data into nested structure as either Parquet or JSON format
val dfOut = dfIn.rdd
.map(
row => ( /* TODO: Need help with mapping flat data to nested structure derived from input column namespaces
*
* For example:
*
* <id>12345<id>
* <name>John Citizen</name>
* <custom_fields>
* <fav_colour>Blue</fav_colour>
* <star_sign>Scorpio</star_sign>
* </custom_fields>
*
*/ ))
// Stop spark
sc.stop()
}
答案 0 :(得分:1)
这可以通过专用的case class
和UDF
来解决,该case class NestedFields(fav_colour: String, star_sign: String)
将输入数据转换为案例类实例。例如:
定义案例类
NestedFields
定义将原始列值作为输入并返回private val asNestedFields = udf((fc: String, ss: String) => NestedFields(fc, ss))
实例的UDF:
val res = dfIn.withColumn("custom_fields", asNestedFields($"`custom_fields.fav_colour`", $"`custom_fields.star_sign`"))
.drop($"`custom_fields.fav_colour`")
.drop($"`custom_fields.star_sign`")
转换原始DataFrame并放下扁平列:
root
|-- id: string (nullable = false)
|-- name: string (nullable = false)
|-- custom_fields: struct (nullable = true)
| |-- fav_colour: string (nullable = true)
| |-- star_sign: string (nullable = true)
+------+------------+---------------+
| id| name| custom_fields|
+------+------------+---------------+
|123456|John Citizen|[Blue, Scorpio]|
|990087| Jane Simth|[Green, Taurus]|
+------+------------+---------------+
它产生
{{1}}
答案 1 :(得分:1)
这是一个通用的解决方案,它首先组装包含.
的列名的Map,遍历Map以将转换后的struct
列添加到DataFrame,最后使用{{ 1}}。稍微更通用的.
用作样本数据。
dfIn
请注意,此解决方案最多只能处理一个嵌套级别。
要将每一行转换为JSON格式,请考虑使用import org.apache.spark.sql.functions._
val dfIn = Seq(
(123456, "John Citizen", "Blue", "Scorpio", "a", 1),
(990087, "Jane Simth", "Green", "Taurus", "b", 2)
).
toDF("id", "name", "custom_fields.fav_colour", "custom_fields.star_sign", "s.c1", "s.c2")
val structCols = dfIn.columns.filter(_.contains("."))
// structCols: Array[String] =
// Array(custom_fields.fav_colour, custom_fields.star_sign, s.c1, s.c2)
val structColsMap = structCols.map(_.split("\\.")).
groupBy(_(0)).mapValues(_.map(_(1)))
// structColsMap: scala.collection.immutable.Map[String,Array[String]] =
// Map(s -> Array(c1, c2), custom_fields -> Array(fav_colour, star_sign))
val dfExpanded = structColsMap.foldLeft(dfIn){ (accDF, kv) =>
val cols = kv._2.map(v => col("`" + kv._1 + "." + v + "`").as(v))
accDF.withColumn(kv._1, struct(cols: _*))
}
val dfResult = structCols.foldLeft(dfExpanded)(_ drop _)
dfResult.show
// +------+------------+-----+--------------+
// |id |name |s |custom_fields |
// +------+------------+-----+--------------+
// |123456|John Citizen|[a,1]|[Blue,Scorpio]|
// |990087|Jane Simth |[b,2]|[Green,Taurus]|
// +------+------------+-----+--------------+
dfResult.printSchema
// root
// |-- id: integer (nullable = false)
// |-- name: string (nullable = true)
// |-- s: struct (nullable = false)
// | |-- c1: string (nullable = true)
// | |-- c2: integer (nullable = false)
// |-- custom_fields: struct (nullable = false)
// | |-- fav_colour: string (nullable = true)
// | |-- star_sign: string (nullable = true)
,如下所示:
toJSON
答案 2 :(得分:1)
此解决方案适用于修订后的要求,即JSON输出将由array of {K:valueK, V:valueV}
而不是{valueK1: valueV1, valueK2: valueV2, ...}
组成。例如:
// FROM:
"custom_fields":{"fav_colour":"Blue", "star_sign":"Scorpio"}
// TO:
"custom_fields":[{"key":"fav_colour", "value":"Blue"}, {"key":"star_sign", "value":"Scorpio"}]
下面的示例代码:
import org.apache.spark.sql.functions._
val dfIn = Seq(
(123456, "John Citizen", "Blue", "Scorpio"),
(990087, "Jane Simth", "Green", "Taurus")
).toDF("id", "name", "custom_fields.fav_colour", "custom_fields.star_sign")
val structCols = dfIn.columns.filter(_.contains("."))
// structCols: Array[String] =
// Array(custom_fields.fav_colour, custom_fields.star_sign)
val structColsMap = structCols.map(_.split("\\.")).
groupBy(_(0)).mapValues(_.map(_(1)))
// structColsMap: scala.collection.immutable.Map[String,Array[String]] =
// Map(custom_fields -> Array(fav_colour, star_sign))
val dfExpanded = structColsMap.foldLeft(dfIn){ (accDF, kv) =>
val cols = kv._2.map( v =>
struct(lit(v).as("key"), col("`" + kv._1 + "." + v + "`").as("value"))
)
accDF.withColumn(kv._1, array(cols: _*))
}
val dfResult = structCols.foldLeft(dfExpanded)(_ drop _)
dfResult.show(false)
// +------+------------+----------------------------------------+
// |id |name |custom_fields |
// +------+------------+----------------------------------------+
// |123456|John Citizen|[[fav_colour,Blue], [star_sign,Scorpio]]|
// |990087|Jane Simth |[[fav_colour,Green], [star_sign,Taurus]]|
// +------+------------+----------------------------------------+
dfResult.printSchema
// root
// |-- id: integer (nullable = false)
// |-- name: string (nullable = true)
// |-- custom_fields: array (nullable = false)
// | |-- element: struct (containsNull = false)
// | | |-- key: string (nullable = false)
// | | |-- value: string (nullable = true)
dfResult.toJSON.show(false)
// +-------------------------------------------------------------------------------------------------------------------------------+
// |value |
// +-------------------------------------------------------------------------------------------------------------------------------+
// |{"id":123456,"name":"John Citizen","custom_fields":[{"key":"fav_colour","value":"Blue"},{"key":"star_sign","value":"Scorpio"}]}|
// |{"id":990087,"name":"Jane Simth","custom_fields":[{"key":"fav_colour","value":"Green"},{"key":"star_sign","value":"Taurus"}]} |
// +-------------------------------------------------------------------------------------------------------------------------------+
请注意,由于Spark DataFrame API不支持类型value
,因此我们无法使Any
类型为Any
来容纳不同类型的混合。结果,数组中的value
必须是给定的类型(例如String
)。像以前的解决方案一样,它最多也只能处理一个嵌套级别。