对于第1列中的每个唯一字符串,第2列中最常见的字符串是什么?
例如表格:
1 | 2
-----
A a
A a
A a
A b
B b
B b
B b
B a
B c
C c
C d
C a
结果应该如下:
X | Most common | Weighting
A a 0.75
B b 0.60
C a 0.33
我想使用GROUP BY子句,但我不知道任何适用于字符串的聚合函数。另外,我知道在关系方面如何处理已经有些含糊不清(比如C)。在我的申请中,虽然我只关心加权> 0.50的情况,但模糊性并不重要。
我正在使用SSMS 2014。
答案 0 :(得分:1)
下面的CTE计算表格中每条记录的权重,作为计数的商。然后,我们可以使用行号保留每个col1
分区的第一条记录。请注意,我不会处理关系的情况,尽管我们可以轻松地添加另一个排序来打破平局。
WITH cte AS (
SELECT col1, col2,
1.0 * COUNT(*) OVER (PARTITION BY col1, col2) /
COUNT(*) OVER (PARTITION BY col1) weighting
FROM yourTable
)
SELECT col1, col2, weighting
FROM
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY col1 ORDER BY weighting DESC) rn
FROM cte
) t
WHERE rn = 1
ORDER BY col1;
答案 1 :(得分:1)
所有这些答案看起来都很复杂:
select col1, col2, col2_cnt * 1.0 / col1_cnt
from (select col1, col2,
count(*) as col2_cnt,
sum(count(*)) over (partition by col1) as col1_cnt,
row_number() over (partition by col1 order by count(*) desc) as seqnum
from t
group by col1, col2
) t
where seqnum = 1
答案 2 :(得分:0)
count
应该相当简单;这是一个快速选择:
;WITH CTE
AS (
SELECT nm1
,nm2
,count(*) AS ct
FROM #a
GROUP BY nm1
,nm2
)
,CTE2
AS (
SELECT *
,ROW_NUMBER() OVER (
PARTITION BY nm1 ORDER BY ct DESC
) rn
,ct * 1.0 / (sum(ct) OVER (PARTITION BY nm1)) AS wt
FROM CTE
)
SELECT nm1
,nm2
,wt
FROM CTE2
WHERE rn = 1
请注意,如果您有联系,row_number
可能无法预测 - 如果您希望在有联系时返回两个值,请改用rank
。
答案 3 :(得分:0)
使用count
和row_number
窗口函数执行此操作的一种方法。
select top 1 with ties col1,col2,weighting
from (select col1,col2,1.0*count(*) over(partition by col1,col2)/count(*) over(partition by col1) as weighting
from t
) t
order by row_number() over(partition by col1 order by weighting desc,col2) --in case of ties the row with least col2 value will be picked up