Spark Scala将JSON读作一列

时间:2017-11-08 13:28:22

标签: json scala apache-spark

OS X El Capitan 10.11.6
Spark 2.2.0
Scala 2.11.8

我尝试在Spark中读取的特定JSON文件可以找到here

当我使用下面的代码时,它提供了一个列的输出,它看起来像所有列值一样:

val test = spark.read.format("json").load("Downloads/yql.json")
test.show()

+--------------------+
|               query|
+--------------------+
|[1,2017-11-08T12:...|
+--------------------+

但是,运行test.printSchema()会返回一个适当嵌套的JSON格式的架构。

如何将此文件读入Spark,以便将JSON文件转换为具有多列的DataFrame?

2 个答案:

答案 0 :(得分:0)

使用以下方法获取json数据的模式:

val dataDF = spark.read.option("samplingRatio", "1.0").json("test.json")

在阅读json文件时使用架构进行应用:

val test =spark.read.schema(dataDF.schema).json("test.json")

现在,您可以使用它来查询json的特定列:

test.select($"query.lang").show
+-----+
| lang|
+-----+
|en-us|
+-----+

答案 1 :(得分:0)

我的问题是没有理解如何从嵌套数组(Array和Struct)中提取字段的要求。当然,在发布这个问题后,我遇到了this post,这对我有很大帮助。

以下是我如何从问题中包含的数据源查看DataFrame中的10天预测的示例:

val test = spark.read.format("json").load("Downloads/yql.json")

val testFinal = test.
    select(
        "query.created",
        "query.results.channel.item.lat",
        "query.results.channel.item.long",
        "query.results.channel.units.temperature",
        "query.results.channel.item.forecast").
    withColumn("forecast_explode", explode($"forecast")).
    withColumn("date", $"forecast_explode.date").
    withColumn("forecast_high", $"forecast_explode.high").
    withColumn("forecast_low", $"forecast_explode.low").
    drop($"forecast").
    drop($"forecast_explode")

testFinal.show() 

+--------------------+--------+----------+-----------+-----------+-------------+------------+
|             created|     lat|      long|temperature|       date|forecast_high|forecast_low|
+--------------------+--------+----------+-----------+-----------+-------------+------------+
|2017-11-08T14:57:15Z|40.71455|-74.007118|          F|08 Nov 2017|           48|          39|
|2017-11-08T14:57:15Z|40.71455|-74.007118|          F|09 Nov 2017|           52|          39|
|2017-11-08T14:57:15Z|40.71455|-74.007118|          F|10 Nov 2017|           47|          27|
|2017-11-08T14:57:15Z|40.71455|-74.007118|          F|11 Nov 2017|           40|          25|
|2017-11-08T14:57:15Z|40.71455|-74.007118|          F|12 Nov 2017|           48|          33|
|2017-11-08T14:57:15Z|40.71455|-74.007118|          F|13 Nov 2017|           51|          45|
|2017-11-08T14:57:15Z|40.71455|-74.007118|          F|14 Nov 2017|           53|          43|
|2017-11-08T14:57:15Z|40.71455|-74.007118|          F|15 Nov 2017|           52|          39|
|2017-11-08T14:57:15Z|40.71455|-74.007118|          F|16 Nov 2017|           54|          43|
|2017-11-08T14:57:15Z|40.71455|-74.007118|          F|17 Nov 2017|           54|          44|
+--------------------+--------+----------+-----------+-----------+-------------+------------+