Spark将json对象数据读取为MapType

时间:2018-03-30 09:27:03

标签: scala apache-spark dataframe apache-spark-sql

我编写了一个示例spark应用程序,我在其中使用MapType创建数据框并将其写入磁盘。然后,我正在读同一个文件&打印其架构。与输入模式相比,输出文件模式不同,我在输出中看不到MapType。我如何使用MapType

读取该输出文件

代码

import org.apache.spark.sql.{SaveMode, SparkSession}

case class Department(Id:String,Description:String)
case class Person(name:String,department:Map[String,Department])

object sample {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder.master("local").appName("Custom Poc").getOrCreate
    import spark.implicits._

    val schemaData = Seq(
      Person("Persion1", Map("It" -> Department("1", "It Department"), "HR" -> Department("2", "HR Department"))),
      Person("Persion2", Map("It" -> Department("1", "It Department")))
    )
    val df = spark.sparkContext.parallelize(schemaData).toDF()
    println("Input schema")
    df.printSchema()
    df.write.mode(SaveMode.Overwrite).json("D:\\save\\output")

    println("Output schema")
    spark.read.json("D:\\save\\output\\*.json").printSchema()
  }
}

输出

Input schema
root
 |-- name: string (nullable = true)
 |-- department: map (nullable = true)
 |    |-- key: string
 |    |-- value: struct (valueContainsNull = true)
 |    |    |-- Id: string (nullable = true)
 |    |    |-- Description: string (nullable = true)
Output schema
root
 |-- department: struct (nullable = true)
 |    |-- HR: struct (nullable = true)
 |    |    |-- Description: string (nullable = true)
 |    |    |-- Id: string (nullable = true)
 |    |-- It: struct (nullable = true)
 |    |    |-- Description: string (nullable = true)
 |    |    |-- Id: string (nullable = true)
 |-- name: string (nullable = true)

Json文件

{"name":"Persion1","department":{"It":{"Id":"1","Description":"It Department"},"HR":{"Id":"2","Description":"HR Department"}}}
{"name":"Persion2","department":{"It":{"Id":"1","Description":"It Department"}}}

编辑: 为了解释我的要求,我在上面添加了保存文件部分。在实际场景中,我将只阅读上面提供的JSON数据并处理该数据帧

1 个答案:

答案 0 :(得分:3)

您可以在阅读schema数据的同时从普通数据框传递json

println("Input schema")
df.printSchema()
df.write.mode(SaveMode.Overwrite).json("D:\\save\\output")

println("Output schema")
spark.read.schema(df.schema).json("D:\\save\\output")

输入架构

root
 |-- name: string (nullable = true)
 |-- department: map (nullable = true)
 |    |-- key: string
 |    |-- value: struct (valueContainsNull = true)
 |    |    |-- Id: string (nullable = true)
 |    |    |-- Description: string (nullable = true)

输出架构

root
 |-- name: string (nullable = true)
 |-- department: map (nullable = true)
 |    |-- key: string
 |    |-- value: struct (valueContainsNull = true)
 |    |    |-- Id: string (nullable = true)
 |    |    |-- Description: string (nullable = true)

希望这有帮助!