我有一个镶木地板对象,其中架构包含嵌套条目,如下所示:
{"id" : "1201", "name" : "satish", "age" : "25", "path":[{"x":1,"y":1},{"x":2,"y":2}]}
{"id" : "1202", "name" : "krishna", "age" : "28", "path":[{"x":1.23,"y":2.12},{"x":1.23,"y":2.12}]}
如何使用spark / scala,我可以输出以下内容,根据路径条目将所有内容“展平”:
{"id" : "1201", "name" : "satish", "age" : "25", "x": 1, "y":1}
{"id" : "1201", "name" : "satish", "age" : "25", "x": 2, "y":2}
{"id" : "1202", "name" : "krishna", "age" : "28", "x":1,"y":1}
{"id" : "1202", "name" : "krishna", "age" : "28", "x":2,"y":2}
像:
+---+----+------+-+--+
|age|id |name |x|y |
+---+----+------+--- +
|25 |1201|satish|1|1 |
|25 |1201|satish|1|2 |
+---+----+------+----+
答案 0 :(得分:1)
您可以使用explode并选择以获得结果
示例:
val spark = SparkSession
.builder()
.master("local")
.appName("ParquetAppendMode")
.getOrCreate()
import spark.implicits._
val data1 = spark.read.json("explode.json")
val result = data1.withColumn("path", explode($"path"))
result.select("id", "name", "age", "path.x", "path.y").show()
输出:
+----+-------+---+----+----+
| id| name|age| x| y|
+----+-------+---+----+----+
|1201| satish| 25| 1.0| 1.0|
|1201| satish| 25| 2.0| 2.0|
|1202|krishna| 28|1.23|2.12|
|1202|krishna| 28|1.23|2.12|
+----+-------+---+----+----+
答案 1 :(得分:0)
当您阅读explode
文件时,您只需在path
的{{1}}列上使用dataframe
功能。
当您从parquet
文件中读取时,parquet
为
dataframe
将+---+----+------+--------------+
|age|id |name |path |
+---+----+------+--------------+
|25 |1201|satish|[[1,1], [2,2]]|
+---+----+------+--------------+
和withColumn
函数合并为
explode
您将获得以下输出
dataframe.withColumn("path", explode($"path")).show(false)
如果您仍想将+---+----+------+-----+
|age|id |name |path |
+---+----+------+-----+
|25 |1201|satish|[1,1]|
|25 |1201|satish|[2,2]|
+---+----+------+-----+
列分成两个单独的列,请尝试
path
或者您可以使用val newdf = dataframe.withColumn("path", explode($"path"))
newdf.withColumn("x", newdf("path.x"))
.withColumn("y", newdf("path.x"))
.drop("path").show(false)
查询
select
您的最终结果应为
newdf.select("age", "id", "name", "path.x", "path.y").show(false)
我想这就是你要找的东西