计数窗口功能中的不重复

时间:2019-10-11 22:17:10

标签: sql apache-spark apache-spark-sql

我试图计算每个c的唯一列b,而没有进行分组依据。我知道可以通过join来完成。如何在不诉诸加入的情况下计算(不同的b)(由c划分)。为什么窗口函数不支持计数不同。先感谢您。 给定此数据框:

val df= Seq(("a1","b1","c1"),
                ("a2","b2","c1"),
                ("a3","b3","c1"),
                ("a31",null,"c1"),
                ("a32",null,"c1"),
                ("a4","b4","c11"),
                ("a5","b5","c11"),
                ("a6","b6","c11"),
                ("a7","b1","c2"),
                ("a8","b1","c3"),
                ("a9","b1","c4"),
                ("a91","b1","c5"),
                ("a92","b1","c5"),
                ("a93","b1","c5"),
                ("a95","b2","c6"),
                ("a96","b2","c6"),
                ("a97","b1","c6"),
                ("a977",null,"c6"),
                ("a98",null,"c8"),
                ("a99",null,"c8"),
                ("a999",null,"c8")
                ).toDF("a","b","c");

2 个答案:

答案 0 :(得分:0)

某些数据库确实支持count(distinct)作为窗口函数。  有两种选择。一种是密集等级的总和:

select (dense_rank() over (partition by c order by b asc) +
        dense_rank() over (partition by c order by b desc) -
        1
       ) as count_distinct
from t;

第二个使用子查询:

select sum(case when seqnum = 1 then 1 else 0 end) over (partition by c)
from (select t.*, row_number() over (partition by c order by b) as seqnum
      from t
     ) t;

答案 1 :(得分:0)

  

每个c的唯一列b的数量,不进行分组依据。

典型的SQL解决方法是使用子查询来选择非重复元组,然后在外部查询中选择窗口计数:

SELECT c, COUNT(*) OVER(PARTITION BY c) cnt
FROM (SELECT DISTINCT b, c FROM mytable) x