Question

我正在使用DataFrames处理JSON文件，但无法过滤数组的字段。

这是我的输入结构：

root
 |-- MyObject: struct (nullable = true)
 |    |-- Field1: long (nullable = true)
 |    |-- Field2: string (nullable = true)
 |    |-- Field3: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- Field3_1: boolean (nullable = true)
 |    |    |    |-- Field3_2: string (nullable = true)
 |    |    |    |-- Field3_3: string (nullable = true)
 |    |    |    |-- Field3_3: string (nullable = true)

我想要一个这样的DF：

root
 |-- Field1: long (nullable = true)
 |-- Field3: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- Field3_1: boolean (nullable = true)
 |    |    |-- Field3_3: string (nullable = true)

我最好的就是

df.select($"MyObject.Field1",
          $"MyObject.Field3.Field3_1" as "Field3.Field3_1",
          $"MyObject.Field3.Field3_3" as "Field3.Field3_3")

这给了我：

root
 |-- Field1: long (nullable = true)
 |-- Field3_1: array (nullable = true)
 |    |-- element: boolean (nullable = true)
 |-- Field3_3: array (nullable = true)
 |    |-- element: string (nullable = true)

我不能使用array函数，因为Field3_1和Field3_3的类型不同。

如何创建仅包含选定字段的数组？

我是Spark SQL的初学者，也许我缺少一些东西！
谢谢。

Answer 1

最简单的解决方案是将udf函数用作

import org.apache.spark.sql.functions._
def arraystructUdf = udf((f3:Seq[Row])=> f3.map(row => field3(row.getAs[Boolean]("Field3_1"), row.getAs[String]("Field3_3"))))

df.select(col("MyObject.Field1"), arraystructUdf(col("MyObject.Field3")).as("Field3"))

其中field3是案例类

case class field3(Field3_1:Boolean, Field3_3:String)

应该给您

root
 |-- Field1: long (nullable = true)
 |-- Field3: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- Field3_1: boolean (nullable = false)
 |    |    |-- Field3_3: string (nullable = true)

我希望答案会有所帮助

如何在DataFrame中过滤数组的字段？

1 个答案: