Spark / Scala中的多列聚合

时间:2017-09-21 11:31:28

标签: scala apache-spark

我有一个包含众多列的Spark数据集:

val df = Seq(
  ("a", 2, 3, 5, 3, 4, 2, 6, 7, 3),
  ("a", 1, 1, 2, 4, 5, 7, 3, 5, 2),
  ("b", 5, 7, 3, 6, 8, 8, 9, 4, 2),
  ("b", 2, 2, 3, 5, 6, 3, 2, 4, 8),
  ("b", 2, 5, 5, 4, 3, 6, 7, 8, 8),
  ("c", 1, 2, 3, 4, 5, 6, 7, 8, 9)
).toDF("id", "p1", "p2", "p3", "p4", "p5", "p6", "p7", "p8", "p9")

现在,我希望groupBy超过id并获取每个sum的每个p列的id

目前,我正在执行以下操作:

val dfg =
  df.groupBy("id")
    .agg(
      sum($"p1").alias("p1"),
      sum($"p2").alias("p2"),
      sum($"p3").alias("p3"),
      sum($"p4").alias("p4"),
      sum($"p5").alias("p5"),
      sum($"p6").alias("p6"),
      sum($"p7").alias("p7"),
      sum($"p8").alias("p8"),
      sum($"p9").alias("p9")
    )

产生(正确)输出:

+---+---+---+---+---+---+---+---+---+---+
| id| p1| p2| p3| p4| p5| p6| p7| p8| p9|
+---+---+---+---+---+---+---+---+---+---+
|  c|  1|  2|  3|  4|  5|  6|  7|  8|  9|
|  b|  9| 14| 11| 15| 17| 17| 18| 16| 18|
|  a|  3|  4|  7|  7|  9|  9|  9| 12|  5|
+---+---+---+---+---+---+---+---+---+---+

问题是,实际上我有几十个 p-columns ,我希望能够以更简洁的方式编写聚合。

根据the answers to this question,我已尝试执行以下操作:

val pcols = List.range(1, 10)
val ops = pcols.map(k => sum(df(s"p$k")).alias(s"p$k"))
val dfg =
  df.groupBy("id")
    .agg(ops: _*)  // does not compile — agg does not accept *-parameters

不幸的是,与select()不同,agg()似乎不接受*-parameters,因此这不起作用,产生编译时no ': _*' annotation allowed here错误。< / p>

1 个答案:

答案 0 :(得分:1)

agg有此签名:def agg(expr: Column, exprs: Column*): DataFrame

所以试试这个:

df.groupBy("id")
    .agg(ops.head,ops.tail:_*)