spark聚合列有效地进入Set

时间:2017-03-19 15:56:18

标签: scala apache-spark apache-spark-sql spark-dataframe apache-spark-dataset

如何有效地将列聚合到Spark中的Set(唯一元素数组)?

case class Foo(a:String, b:String, c:Int, d:Array[String])

  val df = Seq(Foo("A", "A", 123, Array("A")),
    Foo("A", "A", 123, Array("B")),
    Foo("B", "B", 123, Array("C", "A")),
    Foo("B", "B", 123, Array("C", "E", "A")),
    Foo("B", "B", 123, Array("D"))
  ).toDS()

将导致

+---+---+---+---------+
|  a|  b|  c|        d|
+---+---+---+---------+
|  A|  A|123|      [A]|
|  A|  A|123|      [B]|
|  B|  B|123|   [C, A]|
|  B|  B|123|[C, E, A]|
|  B|  B|123|      [D]|
+---+---+---+---------+

我正在寻找的是(d列的排序并不重要):

+---+---+---+------------+
|  a|  b|  c|        d  |
+---+---+---+------------+
|  A|  A|123|   [A, B].  |
|  B|  B|123|[C, A, E, D]|
+---+---+---+------------+

这可能与How to aggregate values into collection after groupBy?https://github.com/high-performance-spark/high-performance-spark-examples/blob/57a6267fb77fae5a90109bfd034ae9c18d2edf22/src/main/scala/com/high-performance-spark-examples/transformations/SmartAggregations.scala#L33-L43HighPerformanceSpark中的示例有点类似

使用以下代码:

import org.apache.spark.sql.functions.udf
val flatten = udf((xs: Seq[Seq[String]]) => xs.flatten.distinct)
val d = flatten(collect_list($"d")).alias("d")
df.groupBy($"a", $"b", $"c").agg(d).show

将产生所需的结果,但我想知道是否有可能使用本书中概述的RDD API来提高性能。并且想知道如何使用数据集API来制定它。

有关此最小样本执行的详细信息如下:

== Optimized Logical Plan ==
GlobalLimit 21
+- LocalLimit 21
   +- Aggregate [a#45, b#46, c#47], [a#45, b#46, c#47, UDF(collect_list(d#48, 0, 0)) AS d#82]
      +- LocalRelation [a#45, b#46, c#47, d#48]

== Physical Plan ==
CollectLimit 21
+- SortAggregate(key=[a#45, b#46, c#47], functions=[collect_list(d#48, 0, 0)], output=[a#45, b#46, c#47, d#82])
   +- *Sort [a#45 ASC NULLS FIRST, b#46 ASC NULLS FIRST, c#47 ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(a#45, b#46, c#47, 200)
         +- LocalTableScan [a#45, b#46, c#47, d#48]

SQL dag stage DAG

修改

此操作的问题概述得很清楚https://github.com/awesome-spark/spark-gotchas/blob/master/04_rdd_actions_and_transformations_by_example.md#be-smart-about-groupbykey

EDIT2

正如您所看到的,下面建议的dataSet查询的DAG更复杂,而不是0.4似乎需要2秒。 dag for answer 1

1 个答案:

答案 0 :(得分:1)

试试这个

df.groupByKey(foo => (foo.a, foo.b, foo.c)).
  reduceGroups{
     (foo1, foo2) => 
        foo1.copy(d = (foo1.d ++ foo2.d).distinct )
  }.map(_._2)