示例:
以下是json数据的示例,其中我们可以看到具有不同属性的json:
{"id": 1, "label": "tube", "length": "50m", "diameter": "5cm"}
{"id": 2, "label": "brick", "width": "10cm", "length": "25cm"}
{"id": 3, "label": "sand", "weight": "25kg"}
提问:
是否可以在apache spark中转换结构化数据集中的这个json,如下所示:
+--+-----+------+--------+-----+-------+
|id|label|length|diameter|width|weight |
+--+-----+-----------------------------+
|1 |tube |50m |5cm | | |
|2 |brick|25cm | |10cm | |
|3 |sand | | | |25kg |
+--+-----+------+--------+-----+-------+
答案 0 :(得分:2)
没有问题。只需阅读它,让Spark推断出架构:
val ds = Seq(
"""{"id": 1, "label": "tube", "length": "50m", "diameter": "5cm"}""",
"""{"id": 2, "label": "brick", "width": "10cm", "length": "25cm"}""",
"""{"id": 3, "label": "sand", "weight": "25kg"}"""
).toDS
spark.read.json(ds).show
// +--------+---+-----+------+------+-----+
// |diameter| id|label|length|weight|width|
// +--------+---+-----+------+------+-----+
// | 5cm| 1| tube| 50m| null| null|
// | null| 2|brick| 25cm| null| 10cm|
// | null| 3| sand| null| 25kg| null|
// +--------+---+-----+------+------+-----+
或在读取时提供预期的架构:
import org.apache.spark.sql.types._
val fields = Seq("label", "length", "weight", "width")
val schema = StructType(
StructField("id", LongType) +: fields.map {
StructField(_, StringType)
}
)
spark.read.schema(schema).json(ds).show
// +---+-----+------+------+-----+
// | id|label|length|weight|width|
// +---+-----+------+------+-----+
// | 1| tube| 50m| null| null|
// | 2|brick| 25cm| null| 10cm|
// | 3| sand| null| 25kg| null|
// +---+-----+------+------+-----+