如何在DataFrame中解开数组(来自JSON)?

时间:2017-04-23 11:13:35

标签: scala apache-spark dataframe apache-spark-sql

RDD中的每条记录都包含一个json。我正在使用SQLContext从Json创建一个DataFrame,如下所示:

val signalsJsonRdd = sqlContext.jsonRDD(signalsJson)

以下是架构。 datapayload是一个项目数组。我想爆炸项目数组以获取数据框,其中每一行都是datapayload中的项目。我尝试根据this回答做一些事情,但似乎我需要在案例行(arr:Array [...] )中对项目的整个结构进行建模声明。我可能错过了一些东西。

val payloadDfs = signalsJsonRdd.explode($"data.datapayload"){ 
    case org.apache.spark.sql.Row(arr: Array[String]) =>  arr.map(Tuple1(_)) 
}

上面的代码抛出了一个scala.MatchError,因为实际Row的类型与Row(arr:Array [String])非常不同。可能有一种简单的方法可以做我想要的,但我找不到它。请帮忙。

下面的架构

signalsJsonRdd.printSchema()

root
 |-- _corrupt_record: string (nullable = true)
 |-- data: struct (nullable = true)
 |    |-- dataid: string (nullable = true)
 |    |-- datapayload: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- Reading: struct (nullable = true)
 |    |    |    |    |-- A2DPActive: boolean (nullable = true)
 |    |    |    |    |-- Accuracy: double (nullable = true)
 |    |    |    |    |-- Active: boolean (nullable = true)
 |    |    |    |    |-- Address: string (nullable = true)
 |    |    |    |    |-- Charging: boolean (nullable = true)
 |    |    |    |    |-- Connected: boolean (nullable = true)
 |    |    |    |    |-- DeviceName: string (nullable = true)
 |    |    |    |    |-- Guid: string (nullable = true)
 |    |    |    |    |-- HandsFree: boolean (nullable = true)
 |    |    |    |    |-- Header: double (nullable = true)
 |    |    |    |    |-- Heading: double (nullable = true)
 |    |    |    |    |-- Latitude: double (nullable = true)
 |    |    |    |    |-- Longitude: double (nullable = true)
 |    |    |    |    |-- PositionSource: long (nullable = true)
 |    |    |    |    |-- Present: boolean (nullable = true)
 |    |    |    |    |-- Radius: double (nullable = true)
 |    |    |    |    |-- SSID: string (nullable = true)
 |    |    |    |    |-- SSIDLength: long (nullable = true)
 |    |    |    |    |-- SpeedInKmh: double (nullable = true)
 |    |    |    |    |-- State: string (nullable = true)
 |    |    |    |    |-- Time: string (nullable = true)
 |    |    |    |    |-- Type: string (nullable = true)
 |    |    |    |-- Time: string (nullable = true)
 |    |    |    |-- Type: string (nullable = true)

1 个答案:

答案 0 :(得分:3)

tl; dr explode函数是您的朋友(或我最喜欢的flatMap)。

explode函数为给定数组或映射列中的每个元素创建一个新行。

以下内容应该有效:

signalsJsonRdd.withColumn("element", explode($"data.datapayload"))

请参阅functions对象。