scala读取json并提取所需的列数据

时间:2018-03-30 06:18:40

标签: json scala apache-spark dataframe dataset

我正在阅读json多行json文件包含60多个字段,只需要30个字段作为列,如何从数据框中获取所需的列数据。

scala> peopleDF.printSchema
root
 |-- Applications: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- b_als_o_isehp: boolean (nullable = true)
 |    |    |-- b_als_p_isehp: boolean (nullable = true)
 |    |    |-- b_als_s_isehp: boolean (nullable = true)
 |    |    |-- l_als_o_eventid: long (nullable = true)
 |    |    |-- l_als_o_pid: long (nullable = true)
 |    |    |-- l_als_o_sid: long (nullable = true)

如何仅获取必需的列。(例如l_als_o_pid,l_als_o_eventid,b_als_o_isehp)。

 val peopleDF = spark.read.json("file:///root/users/inputjsondata/s_json2.json")
   var ss = peopleDF.select("Applications");
   ss.createOrReplaceTempView("result2")
   val child = ss.select(explode(peopleDF("Applications.t_als_s_path"))).toDF("app").show()

1 个答案:

答案 0 :(得分:2)

您可以explode第一个array字段,然后选择内部字段为

val peopleDF = spark.read.json("file:///root/users/inputjsondata/s_json2.json")
val newDF = peopleDF.select(explode($"Applications").as("app"))
            .select("app.*")

现在您可以直接选择l_als_o_pid, l_als_o_eventid,b_als_o_isehp等字段 希望这有帮助!