如何从JSON中取消立方体,取消汇总或展平分层/嵌套DataFrame?

时间:2016-08-26 14:49:32

标签: apache-spark spark-dataframe

我有一个由JSON构建的spark数据帧,其架构如下:

root
|-- Engagement ID: string (nullable = true)
|-- Transcript: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- created_at: string (nullable = true)
|    |    |-- message: string (nullable = true)
|    |    |-- sender: struct (nullable = true)
|    |    |    |-- href: string (nullable = true)
|    |    |    |-- name: string (nullable = true)
|    |    |    |-- type: string (nullable = true)
|-- _corrupt_record: string (nullable = true)

我想做的事情是将它弄平,至少要将每个成绩单中的信息分别放在各自的行上,基本上是这样的:

root
|-- Engagement ID: string (nullable = true)
|-- created_at: string (nullable = true)
|-- message: string (nullable = true)
|-- sender: struct (nullable = true)
|-- name: string (nullable = true)
|-- type: string (nullable = true)

当然,这将导致大量具有相同Engagement ID的行,并且原始结构类似于您将执行汇总或多维数据集。是否有一种简单的方法来解开,展开,展平或任何可能的术语?是否可以不使用底层RDD?

0 个答案:

没有答案