已更新

Question

在这种情况下，我想了解在Spark中进行聚合的最佳方法：

import sqlContext.implicits._  
import org.apache.spark.sql.functions._
case class Person(name:String, acc:Int, logDate:String)
val dateFormat = "dd/MM/yyyy"
val filterType = // Could has "MIN" or "MAX" depending on a run parameter
val filterDate = new Timestamp(System.currentTimeMillis)

val df = sc.parallelize(List(Person("Giorgio",20,"31/12/9999"),
                             Person("Giorgio",30,"12/10/2009")
                             Person("Diego",  10,"12/10/2010"),
                             Person("Diego",  20,"12/10/2010"),
                             Person("Diego",  30,"22/11/2011"), 
                             Person("Giorgio",10,"31/12/9999"),
                             Person("Giorgio",30,"31/12/9999"))).toDF()

val df2 = df.withColumn("logDate",unix_timestamp($"logDate",dateFormat).cast(TimestampType))

val df3 = df.groupBy("name").agg(/*conditional aggregation*/)
df3.show /*Expected output show  below */

基本上，我想按name列对所有记录进行分组，然后基于filterType参数，我要过滤某个人的所有有效记录，然后在过滤之后，我想对所有{ {1}}个值获得最终值 acc，其中包含名称和 totalAcc 列。

例如：

filterType = MIN ，我想获取所有具有min（logDate）的记录，所以可能很多，所以基本上在这种情况下，我完全忽略了filterDate参数：

DataFrame

汇总得出的最终结果是：（Diego，30），（Giorgio，30）

filterType = MAX ，我想使用logDate> filterDate记录所有记录，对于一个键，我没有任何关于此条件的记录，我需要使用min（logDate）记录就像在MIN场景中一样，所以：

Diego,10,12/10/2010 Diego,20,12/10/2010 Giorgio,30,12/10/2009

通过汇总获得的最终结果是：（Diego，30），（Giorgio，60） 在这种情况下，对于迭戈，我没有使用logDate> logFilter的任何记录，因此我回退以应用MIN方案，仅对迭戈采用所有具有最小logDate的记录。

Answer 1

您可以使用when/otherwise作为

来编写条件聚合

df2.groupBy("name").agg(sum(when(lit(filterType) === "MIN" && $"logDate" < filterDate, $"acc").otherwise(when(lit(filterType) === "MAX" && $"logDate" > filterDate, $"acc"))).as("sum"))
    .filter($"sum".isNotNull)

这将根据filterType

为您提供所需的输出

但是

最终，您将需要两个汇总的数据帧，所以我建议您避免使用filterType字段，而只需通过使用{<创建>用于分组的其他列 {1}}功能。这样您就可以在一个数据帧中将两个聚合值都作为

when/otherwise

，它将输出为

df2.withColumn("additionalGrouping", when($"logDate" < filterDate, "less").otherwise("more"))
    .groupBy("name", "additionalGrouping").agg(sum($"acc"))
    .drop("additionalGrouping")
    .show(false)

已更新

由于问题是在逻辑改变的情况下更新的，因此这里是改变情况的思路和解决方案

+-------+--------+
|name   |sum(acc)|
+-------+--------+
|Diego  |10      |
|Giorgio|60      |
+-------+--------+

因此对于import org.apache.spark.sql.expressions._ def windowSpec = Window.partitionBy("name").orderBy($"logDate".asc) val minDF = df2.withColumn("minLogDate", first("logDate").over(windowSpec)).filter($"minLogDate" === $"logDate") .groupBy("name") .agg(sum($"acc").as("sum")) val finalDF = if(filterType == "MIN") { minDF } else if(filterType == "MAX"){ val tempMaxDF = df2 .groupBy("name") .agg(sum(when($"logDate" > filterDate,$"acc")).as("sum")) tempMaxDF.filter($"sum".isNull).drop("sum").join(minDF, Seq("name"), "left").union(tempMaxDF.filter($"sum".isNotNull)) } else { df2 }，您应该拥有

filterType = MIN

对于+-------+---+ |name |sum| +-------+---+ |Diego |30 | |Giorgio|30 | +-------+---+，您应该拥有

filterType = MAX

如果+-------+---+ |name |sum| +-------+---+ |Diego |30 | |Giorgio|60 | +-------+---+不是filterType或MAX，则返回原始数据帧

我希望答案会有所帮助

Answer 2

您不需要条件聚合。只需过滤：

SELECT r 
FROM Reservation r 
WHERE r.reservationSeance.id=:seanceId 
  AND r.seanceDate=:seanceDate 
order by r.reservationSeance.id desc limit 0,1

条件聚合Spark DataFrame

2 个答案:

已更新