火花分组后如何根据值过滤

时间:2020-04-21 18:56:50

标签: apache-spark apache-spark-sql

假设我有以下数据框:

val a=Seq(("aa","b",1),("aa","c",5),("aa","d",0),("xx","y",5),("z","zz",9),("z","b",12)).toDF("name","tag","num").show
+----+---+---+
|name|tag|num|
+----+---+---+
|  aa|  b|  1|
|  aa|  c|  5|
|  aa|  d|  0|
|  xx|  y|  5|
|   z| zz|  9|
|   z|  b| 12|
+----+---+---+

我要过滤此dataFrame以便:

对于每组数据(按名称分组),如果列标记的值为'b',我将采用num列的最大值,否则我将忽略行

这是我想要的输出:

+----+---+---+
|name|tag|num|
+----+---+---+
|  aa|  c|  5|
|   z|  b| 12|
+----+---+---+

说明

  • 名称为='aa'的行组中有一个行,其中tag =='b'的值,因此我将这个组的num的最大值为5。
  • 名称为'xx'的行组中没有标签=='b'的行,因此它为w
  • 名称为='z'的行组中有一个行,其中tag =='b'的值,因此我采用的是该组中num的最大值,即12。

1 个答案:

答案 0 :(得分:1)

尝试一下:

val df=Seq(("aa","b",1),("aa","c",5),("aa","d",0),("xx","y",5),("z","zz",9),("z","b",12)).toDF("name","tag","num")
df.createOrReplaceTempView("tab")

val res = spark.sql(""" with tw as (select t1.name, max(t1.num) as max_val
                          from tab t1 
                         where t1.name in (select distinct t2.name 
                                             from tab t2
                                            where t2.tag = 'b'
                                          )
                      group by t1.name )
                      select distinct tz.name, tz.tag, tz.num
                        from tab tz, tw
                       where tz.name = tw.name
                         and tz.num  = tw.max_val
                   """) 
res.show(false)