我想在这里知道一种有效的方法。假设我们有一个JSON数据,如下所示,
root
|-- fields: struct (nullable = true)
| |-- custid: string (nullable = true)
| |-- password: string (nullable = true)
| |-- role: string (nullable = true)
我可以使用
将其读入数据框jsonData_1.withColumn("custid", col("fields.custid")).withColumn("password", col("fields.password")).withColumn("role", col("fields.role"))
但是,如果我们有100列嵌套的列,或者cols倾向于随时间变化或具有更多的嵌套cols,我觉得使用这种方法不是一个好的决定。有什么方法可以使代码通过读取输入的JSON文件自动查找所有列和子列并创建数据框?还是这是唯一的好方法?请在这里分享我的想法。
答案 0 :(得分:2)
无需在spark中指定each and every columns from structtype
。
我们可以通过在struct_field.*
.select
来提取所有结构键
Example:
spark.read.json(Seq("""{"fields":{"custid":"1","password":"foo","role":"rr"}}""").toDS).printSchema
//schema
//root
// |-- fields: struct (nullable = true)
// | |-- custid: string (nullable = true)
// | |-- password: string (nullable = true)
// | |-- role: string (nullable = true)
//read the json data into Dataframe.
val df=spark.read.json(Seq("""{"fields":{"custid":"1","password":"foo","role":"rr"}}""").toDS)
//get all fields values extracted from fields struct
df.select("fields.*").show()
//+------+--------+----+
//|custid|password|role|
//+------+--------+----+
//| 1| foo| rr|
//+------+--------+----+
更平坦的json here动态方式:
import org.apache.spark.sql.Column
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.functions.col
def flattenSchema(schema: StructType, prefix: String = null) : Array[Column] = {
schema.fields.flatMap(f => {
val colName = if (prefix == null) f.name else (prefix + "." + f.name)
f.dataType match {
case st: StructType => flattenSchema(st, colName)
case _ => Array(col(colName))
}
})
}
val df=spark.read.json(Seq("""{"fields":{"custid":"1","password":"foo","role":"rr","nested-2":{"id":"1"}}}""").toDS)
df.select(flattenSchema(df.schema):_*).show()
//+------+---+--------+----+
//|custid| id|password|role|
//+------+---+--------+----+
//| 1| 1| foo| rr|
//+------+---+--------+----+