Question

我有以下JSON输入数据：

{
    "lib": [
      {
        "id": "a1",
        "type": "push",
        "icons": [
          {
            "iId": "111"
          }
        ],
        "id": "a2",
        "type": "pull",
        "icons": [
          {
            "iId": "111"
          },
          {
            "iId": "222"
          }
        ]
      }
]

我想获取以下数据集：

id   type     iId
a1   push     111
a2   pull     111
a2   pull     222

我该怎么办？

这是我当前的代码。我使用Spark 2.3和Java 1.8：

ds = spark
         .read()
         .option("multiLine", true).option("mode", "PERMISSIVE")
         .json(jsonFilePath);

ds = ds
        .select(org.apache.spark.sql.functions.explode(ds.col("lib.icons")).as("icons"));

但是结果是错误的：

+---------------+
|          icons|
+---------------+
|        [[111]]|
|[[111], [222...|
+---------------+

如何获取正确的数据集？

更新：

我尝试使用此代码，但是它会生成输入文件中不存在的id，type和iId的一些额外组合。

ds = ds
      .withColumn("icons", org.apache.spark.sql.functions.explode(ds.col("lib.icons")))
      .withColumn("id", org.apache.spark.sql.functions.explode(ds.col("lib.id")))
      .withColumn("type", org.apache.spark.sql.functions.explode(ds.col("lib.type")));

ds = ds.withColumn("its",  org.apache.spark.sql.functions.explode(ds.col("icons")));

Answer 1

您的JSON似乎格式错误。修复缩进可以使这一点更加明显：

{
  "lib": [
    {
      "id": "a1",
      "type": "push",
      "icons": [
        {
          "iId": "111"
        }
      ],
      "id": "a2",
      "type": "pull",
      "icons": [
        {
          "iId": "111"
        },
        {
          "iId": "222"
        }
      ]
    }
  ]

如果您改为通过JSON来输入代码，代码是否可以正常工作？

{
  "lib": [
    {
      "id": "a1",
      "type": "push",
      "icons": [
        {
          "iId": "111"
        }
      ]
    },
    {
      "id": "a2",
      "type": "pull",
      "icons": [
        {
          "iId": "111"
        },
        {
          "iId": "222"
        }
      ]
    }
  ]
}

请注意，在}, {之前插入的"id": "a2"将带有重复键的对象分成两部分，并在结尾处的结束处}以前已经省略了。

Answer 2

正如已经指出的那样，JSON字符串似乎格式错误。在更新后的代码中，您可以使用以下代码获得所需的结果：

import org.apache.spark.sql.functions._

spark.read
      .format("json")
      .load("in/test.json")
      .select(explode($"lib").alias("result"))
      .select($"result.id", $"result.type", explode($"result.icons").alias("iId"))
      .select($"id", $"type", $"iId.iId")
      .show

如何将JSON分成数据集行？

2 个答案: