我有一个由JSON构建的spark数据帧,其架构如下:
root
|-- Engagement ID: string (nullable = true)
|-- Transcript: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- created_at: string (nullable = true)
| | |-- message: string (nullable = true)
| | |-- sender: struct (nullable = true)
| | | |-- href: string (nullable = true)
| | | |-- name: string (nullable = true)
| | | |-- type: string (nullable = true)
|-- _corrupt_record: string (nullable = true)
我想做的事情是将它弄平,至少要将每个成绩单中的信息分别放在各自的行上,基本上是这样的:
root
|-- Engagement ID: string (nullable = true)
|-- created_at: string (nullable = true)
|-- message: string (nullable = true)
|-- sender: struct (nullable = true)
|-- name: string (nullable = true)
|-- type: string (nullable = true)
当然,这将导致大量具有相同Engagement ID的行,并且原始结构类似于您将执行汇总或多维数据集。是否有一种简单的方法来解开,展开,展平或任何可能的术语?是否可以不使用底层RDD?