这是我的嵌套JSON文件。
{
"dc_id": "dc-101",
"source": {
"sensor-igauge": {
"id": 10,
"ip": "68.28.91.22",
"description": "Sensor attached to the container ceilings",
"temp":35,
"c02_level": 1475,
"geo": {"lat":38.00, "long":97.00}
},
"sensor-ipad": {
"id": 13,
"ip": "67.185.72.1",
"description": "Sensor ipad attached to carbon cylinders",
"temp": 34,
"c02_level": 1370,
"geo": {"lat":47.41, "long":-122.00}
},
"sensor-inest": {
"id": 8,
"ip": "208.109.163.218",
"description": "Sensor attached to the factory ceilings",
"temp": 40,
"c02_level": 1346,
"geo": {"lat":33.61, "long":-111.89}
},
"sensor-istick": {
"id": 5,
"ip": "204.116.105.67",
"description": "Sensor embedded in exhaust pipes in the ceilings",
"temp": 40,
"c02_level": 1574,
"geo": {"lat":35.93, "long":-85.46}
}
}
}
如何使用Spark Scala将JSON文件读取到Dataframe中。 JSON文件中没有数组对象,因此无法使用explode。有人可以帮忙吗?
答案 0 :(得分:2)
val df = spark.read.option("multiline", true).json("data/test.json")
df
.select(col("dc_id"), explode(array("source.*")) as "level1")
.withColumn("id", col("level1.id"))
.withColumn("ip", col("level1.ip"))
.withColumn("temp", col("level1.temp"))
.withColumn("description", col("level1.description"))
.withColumn("c02_level", col("level1.c02_level"))
.withColumn("lat", col("level1.geo.lat"))
.withColumn("long", col("level1.geo.long"))
.drop("level1")
.show(false)
示例输出:
+------+---+---------------+----+------------------------------------------------+---------+-----+-------+
|dc_id |id |ip |temp|description |c02_level|lat |long |
+------+---+---------------+----+------------------------------------------------+---------+-----+-------+
|dc-101|10 |68.28.91.22 |35 |Sensor attached to the container ceilings |1475 |38.0 |97.0 |
|dc-101|8 |208.109.163.218|40 |Sensor attached to the factory ceilings |1346 |33.61|-111.89|
|dc-101|13 |67.185.72.1 |34 |Sensor ipad attached to carbon cylinders |1370 |47.41|-122.0 |
|dc-101|5 |204.116.105.67 |40 |Sensor embedded in exhaust pipes in the ceilings|1574 |35.93|-85.46 |
+------+---+---------------+----+------------------------------------------------+---------+-----+-------+
您可以尝试编写一些通用的UDF来获取所有单独的列,而不是选择每一列。
注意:已通过Spark 2.3测试
答案 1 :(得分:1)
将字符串放入名为jsonString的变量
import org.apache.spark.sql._
import spark.implicits._
val df = spark.read.json(Seq(jsonString).toDS)
val df1 = df.withColumn("lat" ,explode(array("source.sensor-igauge.geo.lat")))
对于其他结构,您也可以遵循相同的步骤-映射/数组结构
答案 2 :(得分:0)
val df = spark.read.option("multiline", true).json("myfile.json")
df.select($"dc_id", explode(array("source.*")))
.select($"dc_id", $"col.c02_level", $"col.description", $"col.geo.lat", $"col.geo.long", $"col.id", $"col.ip", $"col.temp")
.show(false)
输出:
+------+---------+------------------------------------------------+-----+-------+---+---------------+----+
|dc_id |c02_level|description |lat |long |id |ip |temp|
+------+---------+------------------------------------------------+-----+-------+---+---------------+----+
|dc-101|1475 |Sensor attached to the container ceilings |38.0 |97.0 |10 |68.28.91.22 |35 |
|dc-101|1346 |Sensor attached to the factory ceilings |33.61|-111.89|8 |208.109.163.218|40 |
|dc-101|1370 |Sensor ipad attached to carbon cylinders |47.41|-122.0 |13 |67.185.72.1 |34 |
|dc-101|1574 |Sensor embedded in exhaust pipes in the ceilings|35.93|-85.46 |5 |204.116.105.67 |40 |
+------+---------+------------------------------------------------+-----+-------+---+---------------+----+