使用apache spark统一不同的JSON

时间:2018-06-06 08:22:27

标签: apache-spark apache-spark-sql databricks

示例:

以下是json数据的示例,其中我们可以看到具有不同属性的json:

{"id": 1, "label": "tube", "length": "50m", "diameter": "5cm"}
{"id": 2, "label": "brick", "width": "10cm", "length": "25cm"}
{"id": 3, "label": "sand", "weight": "25kg"}

提问:

是否可以在apache spark中转换结构化数据集中的这个json,如下所示:

+--+-----+------+--------+-----+-------+
|id|label|length|diameter|width|weight |
+--+-----+-----------------------------+
|1 |tube |50m   |5cm     |     |       |
|2 |brick|25cm  |        |10cm |       |
|3 |sand |      |        |     |25kg   |
+--+-----+------+--------+-----+-------+

1 个答案:

答案 0 :(得分:2)

没有问题。只需阅读它,让Spark推断出架构:

val ds = Seq(
  """{"id": 1, "label": "tube", "length": "50m", "diameter": "5cm"}""", 
  """{"id": 2, "label": "brick", "width": "10cm", "length": "25cm"}""",
  """{"id": 3, "label": "sand", "weight": "25kg"}"""
).toDS

spark.read.json(ds).show
// +--------+---+-----+------+------+-----+
// |diameter| id|label|length|weight|width|
// +--------+---+-----+------+------+-----+
// |     5cm|  1| tube|   50m|  null| null|
// |    null|  2|brick|  25cm|  null| 10cm|
// |    null|  3| sand|  null|  25kg| null|
// +--------+---+-----+------+------+-----+

或在读取时提供预期的架构:

import org.apache.spark.sql.types._

val fields = Seq("label", "length", "weight", "width")

val schema = StructType(
  StructField("id", LongType) +: fields.map {
    StructField(_, StringType)
  }
)

spark.read.schema(schema).json(ds).show
// +---+-----+------+------+-----+
// | id|label|length|weight|width|
// +---+-----+------+------+-----+
// |  1| tube|   50m|  null| null|
// |  2|brick|  25cm|  null| 10cm|
// |  3| sand|  null|  25kg| null|
// +---+-----+------+------+-----+