如何在数据集上将GROUPING SETS作为运算符/方法?

时间:2016-12-02 02:15:01

标签: apache-spark dataframe apache-spark-sql

spark scala中是否支持功能级别grouping_sets?

我不知道这个补丁适用于master https://github.com/apache/spark/pull/5080

我想通过scala dataframe api进行这种查询。

GROUP BY expression list GROUPING SETS(expression list2)
数据集API中提供了

cuberollup functions,但无法找到分组集。为什么呢?

2 个答案:

答案 0 :(得分:2)

  

我想通过scala dataframe api进行这种查询。

tl; dr 直到Spark 2.1.0,这是不可能的。目前没有计划将此类运算符添加到Dataset API。

Spark SQL支持以下所谓的多维聚合运算符

  • rollup运营商
  • cube运营商
  • GROUPING SETS子句(仅在SQL模式下)
  • grouping()grouping_id()函数

注意:GROUPING SETS仅在SQL模式下可用。 Dataset API不支持。

分组设置

val sales = Seq(
  ("Warsaw", 2016, 100),
  ("Warsaw", 2017, 200),
  ("Boston", 2015, 50),
  ("Boston", 2016, 150),
  ("Toronto", 2017, 50)
).toDF("city", "year", "amount")
sales.createOrReplaceTempView("sales")

// equivalent to rollup("city", "year")
val q = sql("""
  SELECT city, year, sum(amount) as amount
  FROM sales
  GROUP BY city, year
  GROUPING SETS ((city, year), (city), ())
  ORDER BY city DESC NULLS LAST, year ASC NULLS LAST
  """)
scala> q.show
+-------+----+------+
|   city|year|amount|
+-------+----+------+
| Warsaw|2016|   100|
| Warsaw|2017|   200|
| Warsaw|null|   300|
|Toronto|2017|    50|
|Toronto|null|    50|
| Boston|2015|    50|
| Boston|2016|   150|
| Boston|null|   200|
|   null|null|   550|  <-- grand total across all cities and years
+-------+----+------+

// equivalent to cube("city", "year")
// note the additional (year) grouping set
val q = sql("""
  SELECT city, year, sum(amount) as amount
  FROM sales
  GROUP BY city, year
  GROUPING SETS ((city, year), (city), (year), ())
  ORDER BY city DESC NULLS LAST, year ASC NULLS LAST
  """)
scala> q.show
+-------+----+------+
|   city|year|amount|
+-------+----+------+
| Warsaw|2016|   100|
| Warsaw|2017|   200|
| Warsaw|null|   300|
|Toronto|2017|    50|
|Toronto|null|    50|
| Boston|2015|    50|
| Boston|2016|   150|
| Boston|null|   200|
|   null|2015|    50|  <-- total across all cities in 2015
|   null|2016|   250|  <-- total across all cities in 2016
|   null|2017|   250|  <-- total across all cities in 2017
|   null|null|   550|
+-------+----+------+

答案 1 :(得分:0)