如何在Spark SQL中的两列聚合

时间:2017-07-13 18:49:37

标签: scala apache-spark apache-spark-sql

现在我有一个包含以下任务的表:

  1. 按DepartmentID和EmployeeID上的功能分组
  2. 在每个组中,我需要通过(ArrivalDate,ArrivalTime)订购它们并选择第一个。因此,如果两个日期不同,请选择较新的日期。如果两个日期相同,请选择较新的时间。
  3. 我正在尝试这种方法:

    input.select("DepartmenId","EmolyeeID", "ArrivalDate", "ArrivalTime", "Word")
      .agg(here will be the function that handles logic from 2)
      .show()
    

    这里聚合的语法是什么?

    提前谢谢。

    
    
    // +-----------+---------+-----------+-----------+--------+
    // |DepartmenId|EmolyeeID|ArrivalDate|ArrivalTime|   Word |
    // +-----------+---------+-----------+-----------+--------+
    // |     D1    |   E1    |  20170101 |    0730   |  "YES" |
    // +-----------+---------+-----------+-----------+--------+
    // |     D1    |   E1    |  20170102 |    1530   |  "NO"  |
    // +-----------+---------+-----------+-----------+--------+
    // |     D1    |   E2    |  20170101 |    0730   |  "ZOO" |
    // +-----------+---------+-----------+-----------+--------+
    // |     D1    |   E2    |  20170102 |    0330   |  "BOO" |
    // +-----------+---------+-----------+-----------+--------+
    // |     D2    |   E1    |  20170101 |    0730   |  "LOL" |
    // +-----------+---------+-----------+-----------+--------+
    // |     D2    |   E1    |  20170101 |    1830   |  "ATT" |
    // +-----------+---------+-----------+-----------+--------+
    // |     D2    |   E2    |  20170105 |    1430   |  "UNI" |
    // +-----------+---------+-----------+-----------+--------+
    
    
    // output should be
    
    // +-----------+---------+-----------+-----------+--------+
    // |DepartmenId|EmolyeeID|ArrivalDate|ArrivalTime|   Word |
    // +-----------+---------+-----------+-----------+--------+
    // |     D1    |   E1    |  20170102 |    1530   |  "NO"  |
    // +-----------+---------+-----------+-----------+--------+
    // |     D1    |   E2    |  20170102 |    0330   |  "BOO" |
    // +-----------+---------+-----------+-----------+--------+
    // |     D2    |   E1    |  20170101 |    1830   |  "ATT" |
    // +-----------+---------+-----------+-----------+--------+
    // |     D2    |   E2    |  20170105 |    1430   |  "UNI" |
    // +-----------+---------+-----------+-----------+--------+
    
    
    

2 个答案:

答案 0 :(得分:2)

一种方法是使用Spark Window功能:

val df = Seq(
  ("D1", "E1", "20170101", "0730", "YES"),
  ("D1", "E1", "20170102", "1530", "NO"),
  ("D1", "E2", "20170101", "0730", "ZOO"),
  ("D1", "E2", "20170102", "0330", "BOO"),
  ("D2", "E1", "20170101", "0730", "LOL"),
  ("D2", "E1", "20170101", "1830", "ATT"),
  ("D2", "E2", "20170105", "1430", "UNI")
).toDF(
  "DepartmenId", "EmolyeeID", "ArrivalDate", "ArrivalTime", "Word"
)

import org.apache.spark.sql.expressions.Window

val df2 = df.withColumn("rowNum", row_number().over(
    Window.partitionBy("DepartmenId", "EmolyeeID").
      orderBy($"ArrivalDate".desc, $"ArrivalTime".desc)
  )).
  select("DepartmenId", "EmolyeeID", "ArrivalDate", "ArrivalTime","Word").
  where($"rowNum" === 1).
  orderBy("DepartmenId", "EmolyeeID")

df2.show
+-----------+---------+-----------+-----------+----+
|DepartmenId|EmolyeeID|ArrivalDate|ArrivalTime|Word|
+-----------+---------+-----------+-----------+----+
|         D1|       E1|   20170102|       1530|  NO|
|         D1|       E2|   20170102|       0330| BOO|
|         D2|       E1|   20170101|       1830| ATT|
|         D2|       E2|   20170105|       1430| UNI|
+-----------+---------+-----------+-----------+----+

答案 1 :(得分:1)

您可以在包含所有非分组列的新Struct列上使用max,其中ArrivalData首先,ArrivalTime秒:新列的排序将符合您的要求(最新日期优先;类似日期中的最新时间优先)因此获得最大值将产生您之后的记录。

然后,您可以使用select操作来"拆分"结构回到单独的列。

import spark.implicits._
import org.apache.spark.sql.functions._

df.groupBy($"DepartmentID", $"EmployeeID")
  .agg(max(struct("ArrivalDate", "ArrivalTime", "Word")) as "struct")
  .select($"DepartmentID", $"EmployeeID",
    $"struct.ArrivalDate" as "ArrivalDate",
    $"struct.ArrivalTime" as "ArrivalTime",
    $"struct.Word" as "Word"
  )