如何按SQL Server中最常见的字符串进行分组?

时间:2018-02-21 01:16:46

标签: sql sql-server

对于第1列中的每个唯一字符串,第2列中最常见的字符串是什么?

例如表格:

1 | 2
-----
A   a
A   a
A   a
A   b
B   b
B   b
B   b
B   a
B   c
C   c
C   d
C   a

结果应该如下:

X | Most common | Weighting
A        a         0.75
B        b         0.60
C        a         0.33

我想使用GROUP BY子句,但我不知道任何适用于字符串的聚合函数。另外,我知道在关系方面如何处理已经有些含糊不清(比如C)。在我的申请中,虽然我只关心加权> 0.50的情况,但模糊性并不重要。

我正在使用SSMS 2014。

4 个答案:

答案 0 :(得分:1)

下面的CTE计算表格中每条记录的权重,作为计数的商。然后,我们可以使用行号保留每个col1分区的第一条记录。请注意,我不会处理关系的情况,尽管我们可以轻松地添加另一个排序来打破平局。

WITH cte AS (
    SELECT col1, col2,
        1.0 * COUNT(*) OVER (PARTITION BY col1, col2) /
              COUNT(*) OVER (PARTITION BY col1) weighting
    FROM yourTable
)

SELECT col1, col2, weighting
FROM
(
    SELECT *,
        ROW_NUMBER() OVER (PARTITION BY col1 ORDER BY weighting DESC) rn
    FROM cte
) t
WHERE rn = 1
ORDER BY col1;

enter image description here

Demo

答案 1 :(得分:1)

所有这些答案看起来都很复杂:

select col1, col2, col2_cnt * 1.0 / col1_cnt
from (select col1, col2,
             count(*) as col2_cnt,
             sum(count(*)) over (partition by col1) as col1_cnt,
             row_number() over (partition by col1 order by count(*) desc) as seqnum
      from t
      group by col1, col2
     ) t
where seqnum = 1

答案 2 :(得分:0)

count应该相当简单;这是一个快速选择:

;WITH CTE
AS (
    SELECT nm1
        ,nm2
        ,count(*) AS ct
    FROM #a
    GROUP BY nm1
        ,nm2
    )
    ,CTE2
AS (
    SELECT *
        ,ROW_NUMBER() OVER (
            PARTITION BY nm1 ORDER BY ct DESC
            ) rn
        ,ct * 1.0 / (sum(ct) OVER (PARTITION BY nm1)) AS wt
    FROM CTE
    )
SELECT nm1
    ,nm2
    ,wt
FROM CTE2
WHERE rn = 1

请注意,如果您有联系,row_number可能无法预测 - 如果您希望在有联系时返回两个值,请改用rank

答案 3 :(得分:0)

使用countrow_number窗口函数执行此操作的一种方法。

select top 1 with ties col1,col2,weighting
from (select col1,col2,1.0*count(*) over(partition by col1,col2)/count(*) over(partition by col1) as weighting
      from t
     ) t
order by row_number() over(partition by col1 order by weighting desc,col2) --in case of ties the row with least col2 value will be picked up