通过嵌套数组过滤数据框

时间:2019-06-26 18:50:55

标签: scala apache-spark apache-spark-sql

给出以下结构:

root

     |-- lvl_1: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- l1_Attribute: String
     ....
     |    |    |-- lvl_2: array
     |    |    |-- element: struct
     |    |    |    |-- l2_attribute: String
     ...

我想按lvl_2数组内的l2_attribute过滤,该数组嵌套在lvl_1数组内。可以在不先分解lvl_1数组的情况下完成此操作吗?

我可以过滤lvl_1数组而不分解它:

rdd.select(<columns>,
      expr("filter(lvl_1, lvl_1_struct -> upper(lvl_1_struct.l1_Attribute) == 'foo')")

但是我不知道如何对嵌套的lvl_2数组执行此操作。

1 个答案:

答案 0 :(得分:3)

看看是否有帮助。 解决方案是将内部数组弄平并使用org.apache.spark.sql.functions.array_contains函数进行过滤。

如果您使用的是Spark 2.4+,则可以使用高阶函数org.apache.spark.sql.functions.flatten代替解决方案中所示的UDF。(火花2.3)

val df = Seq(
  Seq(
    ("a", Seq(2, 4, 6, 8, 10, 12)),
    ("b", Seq(3, 6, 9, 12)),
    ("c", Seq(1, 2, 3, 4))
  ),
  Seq(
    ("e", Seq(4, 8, 12)),
    ("f", Seq(1, 3, 6)),
    ("g", Seq(3, 4, 5, 6))
  )
).toDF("lvl_1")

df: org.apache.spark.sql.DataFrame = [lvl_1: array<struct<_1:string,_2:array<int>>>]

scala> df.show(false)
+------------------------------------------------------------------+
|lvl_1                                                             |
+------------------------------------------------------------------+
|[[a, [2, 4, 6, 8, 10, 12]], [b, [3, 6, 9, 12]], [c, [1, 2, 3, 4]]]|
|[[e, [4, 8, 12]], [f, [1, 3, 6]], [g, [3, 4, 5, 6]]]              |
+------------------------------------------------------------------+


scala> def flattenSeqOfSeq[S](x:Seq[Seq[S]]): Seq[S] = { x.flatten }
flattenSeqOfSeq: [S](x: Seq[Seq[S]])Seq[S]

scala> val myUdf = udf { flattenSeqOfSeq[Int] _}
myUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,ArrayType(IntegerType,false),Some(List(ArrayType(ArrayType(IntegerType,false),true))))

scala> df.withColumn("flattnedinnerarrays", myUdf($"lvl_1".apply("_2")))
res66: org.apache.spark.sql.DataFrame = [lvl_1: array<struct<_1:string,_2:array<int>>>, flattnedinnerarrays: array<int>]

scala> res66.show(false)
+------------------------------------------------------------------+---------------------------------------------+
|lvl_1                                                             |flattnedinnerarrays                          |
+------------------------------------------------------------------+---------------------------------------------+
|[[a, [2, 4, 6, 8, 10, 12]], [b, [3, 6, 9, 12]], [c, [1, 2, 3, 4]]]|[2, 4, 6, 8, 10, 12, 3, 6, 9, 12, 1, 2, 3, 4]|
|[[e, [4, 8, 12]], [f, [1, 3, 6]], [g, [3, 4, 5, 6]]]              |[4, 8, 12, 1, 3, 6, 3, 4, 5, 6]              |
+------------------------------------------------------------------+---------------------------------------------+

scala> res66.filter(array_contains($"flattnedinnerarrays", 10)).show(false)
+------------------------------------------------------------------+---------------------------------------------+
|lvl_1                                                             |flattnedinnerarrays                          |
+------------------------------------------------------------------+---------------------------------------------+
|[[a, [2, 4, 6, 8, 10, 12]], [b, [3, 6, 9, 12]], [c, [1, 2, 3, 4]]]|[2, 4, 6, 8, 10, 12, 3, 6, 9, 12, 1, 2, 3, 4]|
+------------------------------------------------------------------+---------------------------------------------+

scala> res66.filter(array_contains($"flattnedinnerarrays", 3)).show(false)
+------------------------------------------------------------------+---------------------------------------------+
|lvl_1                                                             |flattnedinnerarrays                          |
+------------------------------------------------------------------+---------------------------------------------+
|[[a, [2, 4, 6, 8, 10, 12]], [b, [3, 6, 9, 12]], [c, [1, 2, 3, 4]]]|[2, 4, 6, 8, 10, 12, 3, 6, 9, 12, 1, 2, 3, 4]|
|[[e, [4, 8, 12]], [f, [1, 3, 6]], [g, [3, 4, 5, 6]]]              |[4, 8, 12, 1, 3, 6, 3, 4, 5, 6]              |
+------------------------------------------------------------------+---------------------------------------------+