我有聚合功能,别名和其他作为JSON配置的列表,例如
{
"aggregation": [{
"alias_column_name1": {
"sum": "<columnName1>"
}
}, {
"alias_column_name2": {
"sum": "<columnName1>"
}
}]
}
目前,我已经通过以下代码执行了该操作:
val col1:Column = sum(<dataframeName>(<columnName1>)).alias(<alias_column_name1>)
val col2:Column = sum(<dataframeName>(<columnName2>)).alias(<alias_column_name2>)
dataframe.groupby(..).agg(col1, col2)
但是我有很多聚合配置,我想通过聚合方法传递此类列表,例如
val colList = List[Column](col1, col2)
dataframe.groupby(..).agg(colList)
如何实现相同目标?谢谢
版本:
Scala : 2.11
Spark : 2.2.2
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.2.2"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.2"
libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.2.2"
答案 0 :(得分:1)
列和函数的单独列表
假设您有一系列功能:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
val funs: Seq[Column => Column] = Seq(sum _, min _, max _)
和列列表
val cols: Seq[Column] = Seq($"y", $"z")
和数据集
val df = Seq((1, 2, 3), (1, 4, 5) ).toDF("x", "y", "z")
您可以将两者结合在一起
val exprs = for { c <- cols; f <- funs} yield f(c)
然后
df.groupBy($"x").agg(exprs.head, exprs.tail: _*)
在PySpark中可以做同样的事情:
from pyspark.sql import functions as F
funs = [F.sum, F.min, F.max]
cols = ["y", "z"]
df = spark.createDataFrame([(1, 2, 3), (1, 4, 5)], ("x", "y", "z"))
df.groupBy("x").agg(*[f(c) for c in cols for f in funs])
每列的预定义操作列表
如果要从一组预定义的别名,列和函数开始(如您的问题所示),将其重组为
trait AggregationOp {
def expr: Column
}
case class FuncAggregationOp(c: Column, func: Column => Column, alias: String
) extends AggregationOp {
def expr = func(c).alias(alias)
}
val ops: Seq[AggregationOp] = Seq(
FuncAggregationOp($"y", sum _, "alias_column_name1"),
FuncAggregationOp($"z", sum _, "alias_column_name2")
)
val exprs = ops.map(_.expr)
df.groupBy($"x").agg(exprs.head, exprs.tail: _*)
您可以轻松调整以处理其他情况:
case class StringAggregationOp(c: String, func: String, alias: String
) extends AggregationOp {
def expr = org.apache.spark.sql.functions.expr(s"${func}(`${c}`)").alias(alias)
}
val ops: Seq[AggregationOp] = Seq(
StringAggregationOp("y", "sum", "alias_column_name1"),
StringAggregationOp("z", "sum", "alias_column_name2")
)
Python等效项可能是这样的:
from collections import namedtuple
from pyspark.sql import functions as F
class AggregationOp(namedtuple("Op", ["c", "func", "alias"])):
def expr(self):
if callable(self.func):
return self.func(self.c).alias(self.alias)
else:
return F.expr("{func}(`{c}`)".format
(func = self.func, c = self.c)).alias(self.alias)
ops = [
AggregationOp("y", "sum", "alias_column_name1"),
AggregationOp("z", "sum", "alias_column_name2")
]
df.groupBy("x").agg(*[op.expr() for op in ops])
相关问题: