Question

我有相同数据集的两个版本，我需要进行完全连接以查找其中一个缺失的记录，两者都有一些丢失的记录。我设法提出了两种方法，但两者都有缺点。我的数据集大小和过滤条件非常大。

解决方案1有一个使用CTE的缺点，它将拆分过滤器并使代码更难阅读，我想只有一个查询：

create table #temp (id int, vers nvarchar(1))
insert into #temp select 1,'a' union select 2,'a' union select 3,'a'
              union select 1,'b' union select 2,'b' union select 100,'b'

;WITH vers_a as (SELECT * FROM #temp WHERE vers = 'a')
,vers_b as (SELECT * FROM #temp WHERE vers = 'b')

SELECT ta.id, tb.id, ta.vers, tb.vers
FROM vers_a ta
FULL JOIN vers_b tb on ta.id = tb.id
WHERE ta.id is null or tb.id is null

drop table #temp

解决方案2复制过滤器并且执行计划更大：

create table #temp (id int, vers nvarchar(1))
insert into #temp select 1,'a' union select 2,'a' union select 3,'a'
              union select 1,'b' union select 2,'b' union select 100,'b'

SELECT ta.id, tb.id, ta.vers, tb.vers
FROM #temp ta
FULL JOIN #temp tb on ta.id = tb.id and ta.vers = 'a' and tb.vers = 'b'
WHERE (ta.id is null or tb.id is null) and (ta.vers = 'a' or tb.vers = 'b')

drop table #temp

所以我的问题是，是否有可能像解决方案2那样但没有双重条件定义和较小的执行计划，如解决方案1？

编辑：在一个查询中运行两个解决方案时，我可以看到解决方案2占26％，解决方案1占45％，尽管它的执行计划较小。我想要更快的解决方案（不一定像我在问题中所说的那样执行计划较小），如果可能的话，没有代码重复。

Edit2：很抱歉误导第一次编辑，我不擅长优化:)我测试了这个~1.5mil的行设置，解决方案1更快，得到的设置使用了这个：

create table #temp (id int, vers nvarchar(1))
insert into #temp select 1,'a' union select 2,'a' union select 3,'a'
              union select 1,'b' union select 2,'b' union select 100,'b'
while (select count(*) from #temp) < 1000000
begin
    insert into #temp select id+ABS(CHECKSUM(NewId()))%10000, vers from #temp
end

Answer 1

这应该有一个好的计划。 vers上的索引可能会有所帮助。

SELECT ta.id, tb.id, ta.vers, tb.vers
FROM (SELECT * FROM #temp WHERE vers = 'a') ta
FULL JOIN (SELECT * FROM #temp WHERE vers = 'b') tb on ta.id = tb.id 
WHERE (ta.id is null or tb.id is null)

修改做了一些测试。上面的查询有更好的CPU然后2个其他版本。

-- SETUP drop table temp; go create table temp ( id int ,vers nvarchar(1)); insert temp(id,vers) select top(100000) row_number() over(order by (select null)) / 2 , case ABS(CHECKSUM(NewId())) % 2 when 0 then 'a' else 'b' end from sys.all_objects t, sys.all_objects t1 ; create index idx_temp_vers on temp(vers) include(id) with fillfactor=90; select top(50) * from temp; -- TEST RUNS SET STATISTICS TIME ON; print ' 1 index query 1 ' SELECT ta.id, tb.id, ta.vers, tb.vers FROM (SELECT * FROM temp WHERE vers = 'a') ta FULL JOIN (SELECT * FROM temp WHERE vers = 'b') tb on ta.id = tb.id WHERE (ta.id is null or tb.id is null) ; print ' 1 index query 2 ' SELECT ta.id, tb.id, ta.vers, tb.vers FROM temp ta FULL JOIN temp tb on ta.id = tb.id and ta.vers = 'a' and tb.vers = 'b' WHERE (ta.id is null or tb.id is null) and (ta.vers = 'a' or tb.vers = 'b') ; print ' 1 index query 3 ' SELECT ta.id, TA.vers from temp ta where ta.vers = 'a' and TA.id NOT IN(SELECT tb.id FROM temp tb WHERE tb.vers = 'b') UNION ALL SELECT tb.id, Tb.vers from temp tb where tb.vers = 'b' and Tb.id NOT IN(SELECT ta.id FROM temp ta WHERE ta.vers = 'a') -- One more index create index idx_temp_id on temp(id) with fillfactor=90; print ' 2 indexes query 1 ' SELECT ta.id, tb.id, ta.vers, tb.vers FROM (SELECT * FROM temp WHERE vers = 'a') ta FULL JOIN (SELECT * FROM temp WHERE vers = 'b') tb on ta.id = tb.id WHERE (ta.id is null or tb.id is null) ; print ' 2 indexes query 2 ' SELECT ta.id, tb.id, ta.vers, tb.vers FROM temp ta FULL JOIN temp tb on ta.id = tb.id and ta.vers = 'a' and tb.vers = 'b' WHERE (ta.id is null or tb.id is null) and (ta.vers = 'a' or tb.vers = 'b') ; print ' 2 indexes query 3 ' SELECT ta.id, TA.vers from temp ta where ta.vers = 'a' and TA.id NOT IN(SELECT tb.id FROM temp tb WHERE tb.vers = 'b') UNION ALL SELECT tb.id, Tb.vers from temp tb where tb.vers = 'b' and Tb.id NOT IN(SELECT ta.id FROM temp ta WHERE ta.vers = 'a') SET STATISTICS TIME OFF;

结果

1 index query 1 (49898 row(s) affected) SQL Server Execution Times: CPU time = 156 ms, elapsed time = 3825 ms. 1 index query 2 (49898 row(s) affected) SQL Server Execution Times: CPU time = 281 ms, elapsed time = 2962 ms. 1 index query 3 (49898 row(s) affected) SQL Server Execution Times: CPU time = 422 ms, elapsed time = 2508 ms. 2 indexes query 1 (49898 row(s) affected) SQL Server Execution Times: CPU time = 172 ms, elapsed time = 2679 ms. 2 indexes query 2 (49898 row(s) affected) SQL Server Execution Times: CPU time = 406 ms, elapsed time = 3468 ms. 2 indexes query 3 (49898 row(s) affected) SQL Server Execution Times: CPU time = 407 ms, elapsed time = 3728 ms.

Answer 2

如何做到这一点，也避免了返回带有空值的列

SELECT ta.id, TA.vers from #temp ta 
                    where ta.vers = 'a' 
                            and TA.id NOT IN(SELECT tb.id FROM #temp tb WHERE tb.vers = 'b')
UNION ALL 
SELECT tb.id, Tb.vers from #temp tb 
                    where tb.vers = 'b' 
                            and Tb.id NOT IN(SELECT ta.id FROM #temp ta WHERE ta.vers = 'a')

SQL完全连接与表上的条件

2 个答案: