在使用countDistinct
聚合函数时如何排除空字符串?
输入数据框:
val df = Seq(("2016", 2.1), ("", 2.1), ("2017", 1.4), (null, 1.4), (null, 0.3), (null, 0.3)).toDF("ID", "Val")
df.show(false)
+----+---+
|ID |Val|
+----+---+
|2016|2.1|
| |2.1|
|2017|1.4|
|null|1.4|
|null|0.3|
|null|0.3|
+----+---+
下面是我使用的聚合函数,该函数将空字符串视为值。
df.groupBy("Val")
.agg(countDistinct("ID") as "COUNT").show()
+---+-----+
|Val|COUNT|
+---+-----+
|1.4| 1|
|0.3| 0|
|2.1| 2|-----> should be counted as 1
+---+-----+
如何排除空字符串? 预期结果是:
+---+-----+
|Val|COUNT|
+---+-----+
|1.4| 1|
|0.3| 0|
|2.1| 1|
+---+-----+
答案 0 :(得分:1)
您可以像下面这样在agg
function
本身中应用条件
df.groupBy("Val").agg(countDistinct(when($"ID"=!="",$"ID")) as "COUNT").show()
//output
+---+-----+
|Val|COUNT|
+---+-----+
|1.4| 1|
|0.3| 0|
|2.1| 1|
+---+-----+
答案 1 :(得分:0)
您可以在对数据进行分组之前添加filter
,以删除所有空字符串:
df.filter($"ID" =!= "")
.groupBy("Val")
.agg(countDistinct("ID") as "COUNT")