Question

spark scala中是否支持功能级别grouping_sets？

我不知道这个补丁适用于master https://github.com/apache/spark/pull/5080

我想通过scala dataframe api进行这种查询。

GROUP BY expression list GROUPING SETS(expression list2)

数据集API中提供了

cube和rollup functions，但无法找到分组集。为什么呢？

Answer 1

我想通过scala dataframe api进行这种查询。

tl; dr 直到Spark 2.1.0，这是不可能的。目前没有计划将此类运算符添加到Dataset API。

Spark SQL支持以下所谓的多维聚合运算符：

rollup运营商
cube运营商
GROUPING SETS子句（仅在SQL模式下）
grouping()和grouping_id()函数

注意：GROUPING SETS仅在SQL模式下可用。 Dataset API不支持。

分组设置

val sales = Seq(
  ("Warsaw", 2016, 100),
  ("Warsaw", 2017, 200),
  ("Boston", 2015, 50),
  ("Boston", 2016, 150),
  ("Toronto", 2017, 50)
).toDF("city", "year", "amount")
sales.createOrReplaceTempView("sales")

// equivalent to rollup("city", "year")
val q = sql("""
  SELECT city, year, sum(amount) as amount
  FROM sales
  GROUP BY city, year
  GROUPING SETS ((city, year), (city), ())
  ORDER BY city DESC NULLS LAST, year ASC NULLS LAST
  """)
scala> q.show
+-------+----+------+
|   city|year|amount|
+-------+----+------+
| Warsaw|2016|   100|
| Warsaw|2017|   200|
| Warsaw|null|   300|
|Toronto|2017|    50|
|Toronto|null|    50|
| Boston|2015|    50|
| Boston|2016|   150|
| Boston|null|   200|
|   null|null|   550|  <-- grand total across all cities and years
+-------+----+------+

// equivalent to cube("city", "year")
// note the additional (year) grouping set
val q = sql("""
  SELECT city, year, sum(amount) as amount
  FROM sales
  GROUP BY city, year
  GROUPING SETS ((city, year), (city), (year), ())
  ORDER BY city DESC NULLS LAST, year ASC NULLS LAST
  """)
scala> q.show
+-------+----+------+
|   city|year|amount|
+-------+----+------+
| Warsaw|2016|   100|
| Warsaw|2017|   200|
| Warsaw|null|   300|
|Toronto|2017|    50|
|Toronto|null|    50|
| Boston|2015|    50|
| Boston|2016|   150|
| Boston|null|   200|
|   null|2015|    50|  <-- total across all cities in 2015
|   null|2016|   250|  <-- total across all cities in 2016
|   null|2017|   250|  <-- total across all cities in 2017
|   null|null|   550|
+-------+----+------+

Answer 2

Spark支持GROUPING SETS。你可以在这里找到相应的测试：

https://github.com/apache/spark/blob/5b7d403c1819c32a6a5b87d470f8de1a8ad7a987/sql/core/src/test/resources/sql-tests/inputs/group-analytics.sql#L25-L28

如何在数据集上将GROUPING SETS作为运算符/方法？

2 个答案:

分组设置