Apache Spark:通过嵌套列编写JSON DataFrame分区

时间:2018-10-12 13:46:11

标签: json apache-spark dataframe partition-by

我有这种JSON数据:

{
 "data": [
    {
      "id": "4619623",
      "team": "452144",
      "created_on": "2018-10-09 02:55:51",
      "links": {
        "edit": "https://some_page",
        "publish": "https://some_publish",
        "default": "https://some_default"
      }
    },
    {
      "id": "4619600",
      "team": "452144",
      "created_on": "2018-10-09 02:42:25",
      "links": {
        "edit": "https://some_page",
        "publish": "https://some_publish",
        "default": "https://some_default"
      }
    }
}

我使用Apache Spark读取了这些数据,我想按id列将其写入。当我使用这个: df.write.partitionBy("data.id").json(<path_to_folder>)

我会收到错误消息:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Partition column data.id not found in schema

我也试图使用爆炸功能:

import org.apache.spark.sql.functions.{col, explode}
val renamedDf= df.withColumn("id", explode(col("data.id")))
renamedDf.write.partitionBy("id").json(<path_to_folder>)

这确实有帮助,但是每个id分区文件夹都包含相同的原始JSON文件。

编辑:df DataFrame的架构:

 |-- data: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- created_on: string (nullable = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- links: struct (nullable = true)
 |    |    |    |-- default: string (nullable = true)
 |    |    |    |-- edit: string (nullable = true)
 |    |    |    |-- publish: string (nullable = true)

重命名的Df DataFrame的架构:

 |-- data: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- created_on: string (nullable = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- links: struct (nullable = true)
 |    |    |    |-- default: string (nullable = true)
 |    |    |    |-- edit: string (nullable = true)
 |    |    |    |-- publish: string (nullable = true)
 |-- id: string (nullable = true)

我正在使用spark 2.1.0

我找到了以下解决方案:DataFrame partitionBy on nested columns

这个例子:http://bigdatums.net/2016/02/12/how-to-extract-nested-json-data-in-spark/

但是这些都没有帮助我解决我的问题。

感谢andvance的帮助。

2 个答案:

答案 0 :(得分:0)

尝试以下代码:

val renamedDf = df
         .select(explode(col("data")) as "x" )
         .select($"x.*")             
renamedDf.write.partitionBy("id").json(<path_to_folder>)

答案 1 :(得分:0)

初始爆炸后,您只是缺少一条select语句

USDZ