Question

我有一个像这样的数据帧：

val df = Seq(
  Seq(("a","b","c"))
  )
.toDF("arr")
.select($"arr".cast("array<struct<c1:string,c2:string,c3:string>>"))

df.printSchema

root
 |-- arr: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- c1: string (nullable = true)
 |    |    |-- c2: string (nullable = true)
 |    |    |-- c3: string (nullable = true)

df.show()
+---------+
|      arr|
+---------+
|[[a,b,c]]|
+---------+

我想只选择c1和c3，这样：

df.printSchema

root
 |-- arr: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- c1: string (nullable = true)
 |    |    |-- c3: string (nullable = true)

df.show()

+---------+
|      arr|
+---------+
|[[a,c]]  |
+---------+

这可以在没有UDF的情况下完成吗？

我可以用UDF来做，但是我想要一个没有它的解决方案，比如

df
.select($"arr.c1".as("arr"))

root
 |-- arr: array (nullable = true)
 |    |-- element: string (containsNull = true)

但这仅用于选择1个结构元素，我也尝试过：

df
.select(array(struct($"arr.c1",$"arr.c3")).as("arr"))

但这给出了

root
 |-- arr: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- c1: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- c3: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)

Answer 1

我只能提供Python API的答案，但是我可以肯定Scala API的功能非常相似。

键是函数arrays_zip，根据文档，“ [r]返回结构的合并数组，其中第N个结构包含输入数组的所有第N个值。” < / p>

示例（仍来自文档）：

Runner[T]

如何在spark数据帧API

1 个答案: