Spark / Scala使用多个列上的相同函数重复调用withColumn()

时间:2016-12-30 17:53:32

标签: scala apache-spark dataframe apache-spark-sql user-defined-functions

我目前有代码,我通过多个.withColumn链重复将相同的过程应用于多个DataFrame列,并且我想创建一个简化过程的函数。就我而言,我发现了按键聚合的列的累积总和:

val newDF = oldDF
  .withColumn("cumA", sum("A").over(Window.partitionBy("ID").orderBy("time")))
  .withColumn("cumB", sum("B").over(Window.partitionBy("ID").orderBy("time")))
  .withColumn("cumC", sum("C").over(Window.partitionBy("ID").orderBy("time")))
  //.withColumn(...)

我想要的是:

def createCumulativeColums(cols: Array[String], df: DataFrame): DataFrame = {
  // Implement the above cumulative sums, partitioning, and ordering
}

或更好:

def withColumns(cols: Array[String], df: DataFrame, f: function): DataFrame = {
  // Implement a udf/arbitrary function on all the specified columns
}

3 个答案:

答案 0 :(得分:26)

您可以将select与varargs一起使用,包括*

import spark.implicits._

df.select($"*" +: Seq("A", "B", "C").map(c => 
  sum(c).over(Window.partitionBy("ID").orderBy("time")).alias(s"cum$c")
): _*)

此:

  • 使用Seq("A", ...).map(...)
  • 将列名称映射到窗口表达式
  • 使用$"*" +: ...
  • 添加所有预先存在的列
  • 使用... : _*
  • 解压缩组合序列

可以概括为:

import org.apache.spark.sql.{Column, DataFrame}

/**
 * @param cols a sequence of columns to transform
 * @param df an input DataFrame
 * @param f a function to be applied on each col in cols
 */
def withColumns(cols: Seq[String], df: DataFrame, f: String => Column) =
  df.select($"*" +: cols.map(c => f(c)): _*)

如果您发现withColumn语法更具可读性,则可以使用foldLeft

Seq("A", "B", "C").foldLeft(df)((df, c) =>
  df.withColumn(s"cum$c",  sum(c).over(Window.partitionBy("ID").orderBy("time")))
)

可以概括为例如:

/**
 * @param cols a sequence of columns to transform
 * @param df an input DataFrame
 * @param f a function to be applied on each col in cols
 * @param name a function mapping from input to output name.
 */
def withColumns(cols: Seq[String], df: DataFrame, 
    f: String =>  Column, name: String => String = identity) =
  cols.foldLeft(df)((df, c) => df.withColumn(name(c), f(c)))

答案 1 :(得分:4)

这个问题有点陈旧,但我认为(使用DataFrame作为累加器并通过DataFrame映射折叠列列表会很有用(可能对其他人而言)当列数不是微不足道时,性能结果会有很大差异(有关完整说明,请参阅here)。 长话短说......对于少数列foldLeft很好,否则map会更好。

答案 2 :(得分:0)

在 PySpark 中:

from pyspark.sql import Window
import pyspark.sql.functions as F

window = Window.partitionBy("ID").orderBy("time")
df.select(
    "*", # selects all existing columns
    *[
        F.sum(col).over(windowval).alias(col_name)
        for col, col_name in zip(["A", "B", "C"], ["cumA", "cumB", "cumC"])
    ]
)