GroupBy,条件为汇总Spark / Scala

时间:2019-10-02 20:03:20

标签: scala apache-spark aggregate

我有一个像这样的数据框:

|   ID_VISITE_CALCULE|       TAG_TS_TO_TS|EXTERNAL_PERSON_ID|EXTERNAL_ORGANISATION_ID| RK|
+--------------------+-------------------+------------------+------------------------+---+
|GA1.2.1023040287....|2019-04-23 11:24:19|            dupont|                    null|  1|
|GA1.2.1023040287....|2019-04-23 11:24:19|            durand|                    null|  2|
|GA1.2.105243141.1...|2019-04-23 11:21:01|              null|                    null|  1|
|GA1.2.1061963529....|2019-04-23 11:12:19|              null|                    null|  1|
|GA1.2.1065635192....|2019-04-23 11:07:14|            antoni|                    null|  1|
|GA1.2.1074357108....|2019-04-23 11:11:34|              lang|                    null|  1|
|GA1.2.1074357108....|2019-04-23 11:12:37|              lang|                    null|  2|
|GA1.2.1075803022....|2019-04-23 11:28:38|            cavail|                    null|  1|
|GA1.2.1080137035....|2019-04-23 11:20:00|              null|                    null|  1|
|GA1.2.1081805479....|2019-04-23 11:10:49|              null|                    null|  1|
|GA1.2.1081805479....|2019-04-23 11:10:49|            linare|                    null|  2|
|GA1.2.1111218536....|2019-04-23 11:28:43|              null|                    null|  1|
|GA1.2.1111218536....|2019-04-23 11:32:26|              null|                    null|  2|
|GA1.2.1111570355....|2019-04-23 11:07:00|              null|                    null|  1|
+--------------------+-------------------+------------------+------------------------+---+

我正在尝试应用规则以按ID_VISITE_CALCULE进行汇总,并且仅保留一行ID。

对于ID(一组),我希望:

  • 获取该组的第一个时间戳并将其存储在START列中

  • 获取组的最后一个时间戳并将其存储在END列中

  • 测试整个组的EXTERNAL_PERSON_ID是否相同。 如果是这种情况,并且它为NULL,那么我写NULL;如果是这种情况,并且它是一个名称,那么我写该名称。最后,如果组中的值不同,那么我将注册UNDEFINED

  • 对EXTERNAL_ORGANIZATION_ID列应用完全相同的规则
RESULT :
+--------------------+------------------+------------------------+-------------------+-------------------+
|   ID_VISITE_CALCULE|EXTERNAL_PERSON_ID|EXTERNAL_ORGANISATION_ID|              START|                END|
+--------------------+------------------+------------------------+-------------------+-------------------+
|GA1.2.1023040287....|         undefined|                    null|2019-04-23 11:24:19|2019-04-23 11:24:19|
|GA1.2.105243141.1...|              null|                    null|2019-04-23 11:21:01|2019-04-23 11:21:01|
|GA1.2.1061963529....|              null|                    null|2019-04-23 11:12:19|2019-04-23 11:12:19|
|GA1.2.1065635192....|            antoni|                    null|2019-04-23 11:07:14|2019-04-23 11:07:14|
|GA1.2.1074357108....|              lang|                    null|2019-04-23 11:11:34|2019-04-23 11:12:37|
|GA1.2.1075803022....|            cavail|                    null|2019-04-23 11:28:38|2019-04-23 11:28:38|
|GA1.2.1080137035....|              null|                    null|2019-04-23 11:20:00|2019-04-23 11:20:00|
|GA1.2.1081805479....|         undefined|                    null|2019-04-23 11:10:49|2019-04-23 11:10:49|
|GA1.2.1111218536....|              null|                    null|2019-04-23 11:28:43|2019-04-23 11:32:26|
|GA1.2.1111570355....|              null|                    null|2019-04-23 11:07:00|2019-04-23 11:07:00|
+--------------------+------------------+------------------------+-------------------+-------------------+

在我的示例中,一个组最多只能有2行,但是在实际数据集中,一个组中可以有数百行。

谢谢您的帮助。

2 个答案:

答案 0 :(得分:0)

/如果您从构建数据框的位置显示一些代码/示例数据,那就太好了。

假设您的数据框为tableDf

** Spark Sql解决方案**

tableDf.createOrReplaceTempView("input_table")
val sqlStr ="""
    select ID_VISITE_CALCULE,
           (case when count(distinct person_id_calculation) > 1 then "undefined"
                when count(distinct person_id_calculation) = 1 and 
                     max(person_id_calculation) = "noNull" then ""
                else max(person_id_calculation)) as EXTERNAL_PERSON_ID,
         -- do the same for EXTERNAL_ORGANISATION_ID
        max(start_v) as start_v, max(last_v) as last_v
from
(select ID_VISITE_CALCULE,
           ( case
               when nvl(EXTERNAL_PERSON_ID,"noNull") = 
                    lag(EXTERNAL_PERSON_ID,1,"noNull")over(partition by 
                        ID_VISITE_CALCULE order by TAG_TS_TO_TS) then 
                        EXTERNAL_PERSON_ID
               else "undefined" end ) AS person_id_calculation,
            -- Same calculation for EXTERNAL_ORGANISATION_ID
            first(TAG_TS_TO_TS) over(partition by ID_VISITE_CALCULE order by 
                              TAG_TS_TO_TS) as START_V,
           last(TAG_TS_TO_TS) over(partition by ID_VISITE_CALCULE order by 
                              TAG_TS_TO_TS) as last_V 
           from input_table ) a
   group by 1
"""
val resultDf = spark.sql(sqlStr)

答案 1 :(得分:0)

所有操作都可以在单个groupby调用中完成,但是我建议(略)提高性能,并考虑将代码分为2个调用的可读性:

import org.apache.spark.sql.functions.{col, size, collect_set, max, min, when, lit}

val res1DF = df.groupBy(col("ID_VISITE_CALCULE")).agg(
 min(col("START")).alias("START"),
 max(col("END")).alias("END"),
 collect_set(col("EXTERNAL_PERSON_ID")).alias("EXTERNAL_PERSON_ID"),
 collect_set(col("EXTERNAL_ORGANIZATION_ID")).alias("EXTERNAL_ORGANIZATION_ID")
)

val res2DF = res1DF.withColumn("EXTERNAL_PERSON_ID",
 when(
  size(col("EXTERNAL_PERSON_ID")) > 1, 
  lit("UNDEFINED")).otherwise(col("EXTERNAL_PERSON_ID").getItem(0)
 )
).withColumn("EXTERNAL_ORGANIZATION_ID",
 when(
  size(col("EXTERNAL_ORGANIZATION_ID")) > 1, 
  lit("UNDEFINED")).otherwise(col("EXTERNAL_ORGANIZATION_ID").getItem(0)
 )
)

方法getItem在后​​台执行大多数条件。如果值集为空,则将返回null,如果只有1个单个值,则将返回该值。