Question

我在Hive中有一个表，该表具有以下模式：

root
 |-- startdate: string (nullable = true)
 |-- enddate: string (nullable = true)
 |-- items: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _id: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- .......: string (nullable = true)
 |    |    |-- otherfields: string (nullable = true)

我只想从项目数组字段中获取_id和name列，即：

|-- items: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _id: string (nullable = true)
 |    |    |-- name: string (nullable = true)

有没有一种方法可以在Spark本身中进行其他转换，从而仅从Hive中检索实际的列？

我正在使用Spark 2.2。

Answer 1

您可以尝试以下操作：

data.select("items._id", "items.name")

尽管可能会导致：

root
 |-- _id: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- name: array (nullable = true)
 |    |-- element: string (containsNull = true)

在Spark 2.4+中，您可以尝试利用arrays_zip

Spark SQL如何查询数组[Struct]中的结构字段的子集？

1 个答案: