在Spark数据集中爆炸JSON数组

时间:2017-04-11 13:50:44

标签: json apache-spark dataset

我使用Spark 2.1和Zeppelin 0.7来执行以下操作。 (这是受Databricks教程(https://databricks.com/blog/2017/01/19/real-time-streaming-etl-structured-streaming-apache-spark-2-1.html))

的启发

我创建了以下架构

val jsonSchema = new StructType()
.add("Records", ArrayType(new StructType()
    .add("Id", IntegerType)
    .add("eventDt", StringType)
    .add("appId", StringType)
    .add("userId", StringType)
    .add("eventName", StringType)
    .add("eventValues", StringType)
   )
  )

读入以下json'array'文件,我在'inputPath'目录

{
"Records": [{
    "Id": 9550,
    "eventDt": "1491810477700",
    "appId": "dandb01",
    "userId": "985580",
    "eventName": "OG: HR: SELECT",
    "eventValues": "985087"
    },
    ... other records
]}

val rawRecords = spark.read.schema(jsonSchema).json(inputPath)

然后,我想要将这些记录分解为单个事件

val events = rawRecords.select(explode($"Records").as("record"))

但是rawRecords.show()和events.show()都是null。

知道我做错了什么吗?在过去我知道我应该使用JSONL,但Databricks教程建议最新版本的spark现在应该支持json数组。

1 个答案:

答案 0 :(得分:1)

我做了以下事情:

  1. 我有一个带有以下数据的文件foo.txt
  2.   

    {"记录":[{"标识":9550," eventDt":" 1491810477700"" APPID&# 34;:" dandb01""用户id":" 985580"" eventName的":" OG:HR:SELECT&#34 ;," eventValues":" 985087"},{"标识":9550," eventDt":" 1491810477700" " APPID":" dandb01""用户id":" 985580"" eventName的":" OG:HR:SELECT"" eventValues":" 985087"},{"标识":9550," eventDt":& #34; 1491810477700"" APPID":" dandb01""用户id":" 985580"" eventName的&# 34;:" OG:HR:SELECT"" eventValues":" 985087"},{"标识":9550,&#34 ; eventDt":" 1491810477700"" APPID":" dandb01""用户id":" 985580" " eventName的":" OG:HR:SELECT"" eventValues":" 985087"}]}   {"记录":[{"标识":9550," eventDt":" 1491810477700"" APPID&#34 ;: " dandb01""用户id":" 985580"" eventName的":" OG:HR:SELECT",& #34; eventValues":" 985087"},{"标识":9550," eventDt":" 1491810477700"&# 34; APPID":" dandb01""用户id":" 985580"" eventName的":" OG:HR :SELECT"" eventValues":" 985087"},{"标识":9550," eventDt":" 1491810477700"" APPID":" dandb01""用户id":" 985580"" eventName的&#34 ;: " OG:HR:SELECT"" eventValues":" 985087"},{"标识":9550," eventDt&# 34;:" 1491810477700"" APPID":" dandb01""用户id":" 985580"&# 34; eventName的":" OG:HR:SELECT"" eventValues":" 985087"}]}

    1. 我有以下代码

      import sqlContext.implicits._   import org.apache.spark.sql.functions ._

      val df = sqlContext.read.json(" foo.txt")df.printSchema()
        df.select(爆炸($"记录&#34)如("记录&#34))。显示出

    2. 我得到以下输出

    3.   

      root | - 记录:array(nullable = true)| | - element:struct   (containsNull = true)| | | - Id:long(nullable = true)|
        | | - appId:string(nullable = true)| | | - eventDt:   string(nullable = true)| | | - eventName:string(nullable =   真)| | | - eventValues:string(nullable = true)| |
        | - userId:string(nullable = true)

      +--------------------+
      |              record|
      +--------------------+
      |[9550,dandb01,149...|
      |[9550,dandb01,149...|
      |[9550,dandb01,149...|
      |[9550,dandb01,149...|
      |[9550,dandb01,149...|
      |[9550,dandb01,149...|
      |[9550,dandb01,149...|
      |[9550,dandb01,149...|
      +--------------------+