使用Spark SQL,如何爆炸JSON数组

时间:2017-03-13 02:00:02

标签: arrays json apache-spark-sql

在集群类型"社区优化,Spark 2.1(自动更新,Scala 2.11)"上使用带有Spark SQL的Databricks笔记本。 (即免费社区帐户),我为表response_raw定义了以下架构:

root
 |-- TrackingLabel: string (nullable = true)
 |-- Json: string (nullable = true)

TrackingLabel列包含一个如下所示的值:

000009FE-AA24-E511-8766-00505692010A

Json列包含一个如下所示的值:

{
  "transaction": {
    ...
    "generatedInMillis": 24
  },
  "request": {
    "trackingLabel": "4f10225b-078d-11e7-8a18-020a01f809ab",
    ...
  },
  "response": {
    "locations": [
      {
        "trackingLabel": "0",
        ...
        "inputDelta": {
          "grade": {
            "letter": "B-",
            "number": 80,
            ...
          }
        }
      }
    ]
  }
}

我定义了一个新的Notebook单元格:

drop table if exists response_json;
create table response_json as
SELECT rr.TrackingLabel
     , v1.transaction as TransactionJson
     , v1.request as RequestJson
     , v1.response as ResponseJson
  FROM response_raw AS rr
    LATERAL VIEW json_tuple(rr.json, 'transaction', 'request', 'response') v1 AS transaction, request, response

执行单元格后,架构显示为:

root
 |-- TrackingLabel: string (nullable = true)
 |-- TransactionJson: string (nullable = true)
 |-- RequestJson: string (nullable = true)
 |-- ResponseJson: string (nullable = true)

然后我用这个创建一个新的Notebook单元格:

drop table if exists response_json_locations;
create table response_json_locations as
SELECT rj.TrackingLabel
     , v1.locations
  FROM response_json AS rj
    LATERAL VIEW json_tuple(rj.ResponseJson, 'locations') v1 AS locations

执行单元格后,架构显示为:

root
 |-- TrackingLabel: string (nullable = true)
 |-- locations: string (nullable = true)

然后我用这个定义了一个新的Notebook单元格:

SELECT locationx.trackingLabel
  FROM response_json_locations AS rjl
  LATERAL VIEW explode(rjl.locations) l AS locationx

当我执行此单元格时,收到以下错误:

Error in SQL statement: AnalysisException: cannot resolve 'explode(rjl.`locations`)' due to data type mismatch: input to function explode should be array or map type, not StringType; line 3 pos 2;
'Project ['locationx.trackingLabel]
+- 'Generate explode(locations#6127), true, false, l, ['locationx]
   +- SubqueryAlias rjl
      +- SubqueryAlias response_json_locations
         +- Relation[TrackingLabel#6126,locations#6127] parquet

我已经编写并重写了最后一个单元格十几次,试图弄清楚如何访问JSON数组。我需要访问位置JSON数组的每个元素内的inputDelta.grade.letter和inputDelta.grade.number值。我必须在如何explode response.locations中遗漏一些相当简单的东西。您对此提供的任何指导都非常感谢。

0 个答案:

没有答案