假设我有以下数据框:
val a=Seq(("aa","b",1),("aa","c",5),("aa","d",0),("xx","y",5),("z","zz",9),("z","b",12)).toDF("name","tag","num").show
+----+---+---+
|name|tag|num|
+----+---+---+
| aa| b| 1|
| aa| c| 5|
| aa| d| 0|
| xx| y| 5|
| z| zz| 9|
| z| b| 12|
+----+---+---+
我要过滤此dataFrame以便:
对于每组数据(按名称分组),如果列标记的值为'b',我将采用num列的最大值,否则我将忽略行
这是我想要的输出:
+----+---+---+
|name|tag|num|
+----+---+---+
| aa| c| 5|
| z| b| 12|
+----+---+---+
说明
答案 0 :(得分:1)
尝试一下:
val df=Seq(("aa","b",1),("aa","c",5),("aa","d",0),("xx","y",5),("z","zz",9),("z","b",12)).toDF("name","tag","num")
df.createOrReplaceTempView("tab")
val res = spark.sql(""" with tw as (select t1.name, max(t1.num) as max_val
from tab t1
where t1.name in (select distinct t2.name
from tab t2
where t2.tag = 'b'
)
group by t1.name )
select distinct tz.name, tz.tag, tz.num
from tab tz, tw
where tz.name = tw.name
and tz.num = tw.max_val
""")
res.show(false)