spark scala中是否支持功能级别grouping_sets?
我不知道这个补丁适用于master https://github.com/apache/spark/pull/5080
我想通过scala dataframe api进行这种查询。
GROUP BY expression list GROUPING SETS(expression list2)
数据集API中提供了 cube
和rollup
functions,但无法找到分组集。为什么呢?
答案 0 :(得分:2)
我想通过scala dataframe api进行这种查询。
tl; dr 直到Spark 2.1.0,这是不可能的。目前没有计划将此类运算符添加到Dataset API。
Spark SQL支持以下所谓的多维聚合运算符:
rollup
运营商cube
运营商GROUPING SETS
子句(仅在SQL模式下)grouping()
和grouping_id()
函数注意:GROUPING SETS
仅在SQL模式下可用。 Dataset API不支持。
val sales = Seq(
("Warsaw", 2016, 100),
("Warsaw", 2017, 200),
("Boston", 2015, 50),
("Boston", 2016, 150),
("Toronto", 2017, 50)
).toDF("city", "year", "amount")
sales.createOrReplaceTempView("sales")
// equivalent to rollup("city", "year")
val q = sql("""
SELECT city, year, sum(amount) as amount
FROM sales
GROUP BY city, year
GROUPING SETS ((city, year), (city), ())
ORDER BY city DESC NULLS LAST, year ASC NULLS LAST
""")
scala> q.show
+-------+----+------+
| city|year|amount|
+-------+----+------+
| Warsaw|2016| 100|
| Warsaw|2017| 200|
| Warsaw|null| 300|
|Toronto|2017| 50|
|Toronto|null| 50|
| Boston|2015| 50|
| Boston|2016| 150|
| Boston|null| 200|
| null|null| 550| <-- grand total across all cities and years
+-------+----+------+
// equivalent to cube("city", "year")
// note the additional (year) grouping set
val q = sql("""
SELECT city, year, sum(amount) as amount
FROM sales
GROUP BY city, year
GROUPING SETS ((city, year), (city), (year), ())
ORDER BY city DESC NULLS LAST, year ASC NULLS LAST
""")
scala> q.show
+-------+----+------+
| city|year|amount|
+-------+----+------+
| Warsaw|2016| 100|
| Warsaw|2017| 200|
| Warsaw|null| 300|
|Toronto|2017| 50|
|Toronto|null| 50|
| Boston|2015| 50|
| Boston|2016| 150|
| Boston|null| 200|
| null|2015| 50| <-- total across all cities in 2015
| null|2016| 250| <-- total across all cities in 2016
| null|2017| 250| <-- total across all cities in 2017
| null|null| 550|
+-------+----+------+
答案 1 :(得分:0)