如何在镶木地板/火花/斯卡拉中获得平坦的物体/弦乐输出?

时间:2017-06-01 04:46:23

标签: scala apache-spark

我有一个镶木地板对象,其中架构包含嵌套条目,如下所示:

{"id" : "1201", "name" : "satish", "age" : "25", "path":[{"x":1,"y":1},{"x":2,"y":2}]}
{"id" : "1202", "name" : "krishna", "age" : "28", "path":[{"x":1.23,"y":2.12},{"x":1.23,"y":2.12}]}

如何使用spark / scala,我可以输出以下内容,根据路径条目将所有内容“展平”:

{"id" : "1201", "name" : "satish", "age" : "25", "x": 1, "y":1}
{"id" : "1201", "name" : "satish", "age" : "25", "x": 2, "y":2}
{"id" : "1202", "name" : "krishna", "age" : "28", "x":1,"y":1}
{"id" : "1202", "name" : "krishna", "age" : "28", "x":2,"y":2}

像:

+---+----+------+-+--+
|age|id  |name  |x|y |
+---+----+------+--- +
|25 |1201|satish|1|1 |
|25 |1201|satish|1|2 |
+---+----+------+----+

2 个答案:

答案 0 :(得分:1)

您可以使用explode并选择以获得结果

示例:

  val spark = SparkSession
    .builder()
    .master("local")
    .appName("ParquetAppendMode")
    .getOrCreate()

  import spark.implicits._


  val data1 = spark.read.json("explode.json")

  val result = data1.withColumn("path", explode($"path"))

  result.select("id", "name", "age", "path.x", "path.y").show()

输出:

+----+-------+---+----+----+
|  id|   name|age|   x|   y|
+----+-------+---+----+----+
|1201| satish| 25| 1.0| 1.0|
|1201| satish| 25| 2.0| 2.0|
|1202|krishna| 28|1.23|2.12|
|1202|krishna| 28|1.23|2.12|
+----+-------+---+----+----+

答案 1 :(得分:0)

当您阅读explode文件时,您只需在path的{​​{1}}列上使用dataframe功能。

当您从parquet文件中读取时,parquet

dataframe

+---+----+------+--------------+ |age|id |name |path | +---+----+------+--------------+ |25 |1201|satish|[[1,1], [2,2]]| +---+----+------+--------------+ withColumn函数合并为

explode

您将获得以下输出

dataframe.withColumn("path", explode($"path")).show(false)

如果您仍想将+---+----+------+-----+ |age|id |name |path | +---+----+------+-----+ |25 |1201|satish|[1,1]| |25 |1201|satish|[2,2]| +---+----+------+-----+ 列分成两个单独的列,请尝试

path

或者您可以使用val newdf = dataframe.withColumn("path", explode($"path")) newdf.withColumn("x", newdf("path.x")) .withColumn("y", newdf("path.x")) .drop("path").show(false) 查询

select

您的最终结果应为

newdf.select("age", "id", "name", "path.x", "path.y").show(false)

我想这就是你要找的东西