我需要确定在最高级别分组时无法达到大于等于阈值的计数的行。如果某行在较低的分组级别上满足阈值,那么将不考虑将这些行用于更高级别的检查。
例如:
我有这样的值,阈值为5。
COL_1 COL_2 COL_3
CH ZZZZZZ T77613
CH ZZZZZZ R537973
**CH 181600 19M8323**
**CH HYC440 RE575008**
**CH 211000 AE74215**
CH ZZZZZZ T77858
CH ZZZZZZ T76938
CH ZZZZZZ T77932
CH ZZZZZZ T76938
CH ZZZZZZ 14M7396
CH ZZZZZZ RE593267
CH ZZZZZZ RE593267
CH ZZZZZZ RE579130
CH ZZZZZZ 14M7296
CH ZZZZZZ RE580337
CH ZZZZZZ RE580337
仅需选择粗体行。
我正在使用如下查询
WITH Step1 AS (
SELECT x1.*
FROM mytable AS x1
LEFT JOIN (
SELECT col_1
,col_2
,col_3
FROM mytable
GROUP BY col_1
,col_2
,col_3
HAVING COUNT(*) >= 5
) y1 ON x1.col_1 = y1.col_1
AND x1.col_2 = y1.col_2
AND x1.col_3 = y1.col_3
WHERE y1.col_1 IS NULL
AND y1.col_2 IS NULL
AND y1.col_3 IS NULL
)
,Step2 AS (
SELECT x1.*
FROM Step1 x1
LEFT JOIN (
SELECT col_1
,col_2
FROM Step1
GROUP BY col_1
,col_2
HAVING COUNT(*) >= 5
) y1 ON x1.col_1 = y1.col_1
AND x1.col_2 = y1.col_2
WHERE y1.col_1 IS NULL
AND y1.col_2 IS NULL
)
,Step3 AS (
SELECT x1.*
FROM Step2 x1
LEFT JOIN (
SELECT col_1
FROM Step2
GROUP BY col_1
HAVING COUNT(*) >= 5
) y1 ON x1.col_1 = y1.col_1
WHERE y1.col_1 IS NULL
)
SELECT *
FROM Step3
此查询给出正确的结果。但是,一旦表中的行数超过17000左右,sql查询就会挂起并超时。
任何人都知道出了什么问题,并且可以提供更好的解决方案?
更新:
我从https://www.sqlshack.com/why-is-my-cte-so-slow/找到了一些答案。使用临时表存储前两个CTE的结果后,我能够运行查询并在45秒内获得结果。
WITH Step1 AS (
SELECT x1.*
FROM mytable AS x1
LEFT JOIN (
SELECT col_1
,col_2
,col_3
FROM mytable
GROUP BY col_1
,col_2
,col_3
HAVING COUNT(*) >= 5
) y1 ON x1.col_1 = y1.col_1
AND x1.col_2 = y1.col_2
AND x1.col_3 = y1.col_3
WHERE y1.col_1 IS NULL
AND y1.col_2 IS NULL
AND y1.col_3 IS NULL
)
,Step2 AS (
SELECT x1.*
FROM Step1 x1
LEFT JOIN (
SELECT col_1
,col_2
FROM Step1
GROUP BY col_1
,col_2
HAVING COUNT(*) >= 5
) y1 ON x1.col_1 = y1.col_1
AND x1.col_2 = y1.col_2
WHERE y1.col_1 IS NULL
AND y1.col_2 IS NULL
)
select * into #CTE2 from step2 ;
WITH Step3 AS (
SELECT x1.*
FROM #CTE2 x1
LEFT JOIN (
SELECT col_1
FROM Step2
GROUP BY col_1
HAVING COUNT(*) >= 5
) y1 ON x1.col_1 = y1.col_1
WHERE y1.col_1 IS NULL
)
SELECT *
FROM Step3 ;
但这确实意味着它不再是单个sql查询。
答案 0 :(得分:0)
您的要求根本不清楚,但是正如您所说的那样,您的查询给出了正确的结果,而您的实际问题仅是性能,我将开始使用HAVING替换那些剩余的EXISTS联接以获取您的数据已经想返回,而不是放弃...
下一步是检查表是否正确索引
;WITH
Step1 AS (
SELECT *
FROM MyTable S1
WHERE EXISTS (
SELECT 1
FROM MyTable
WHERE COL_1 = S1.COL_1 AND COL_2 = S1.COL_2 ANd COL_3 = S1.COL_3
GROUP BY COL_1, COL_2, COL_3
HAVING COUNT(*) < 5 )
) ,
Step2 AS
(
SELECT *
FROM Step1 S1
WHERE EXISTS (
SELECT 1
FROM Step1
WHERE COL_1 = S1.COL_1 AND COL_2 = S1.COL_2
GROUP BY COL_1,COL_2
HAVING COUNT(*) < 5 )
) ,
Step3 AS
(
SELECT *
FROM Step2 S2
WHERE EXISTS (
SELECT 1
FROM Step2
WHERE COL_1 = S2.COL_1
GROUP BY COL_1
HAVING COUNT(*) < 5 )
)
SELECT * FROM Step3