我的SQL查询就像这样
INSERT INTO staging.lps_data
(
col1
,col2
,col3
,col4
,col5
)
SELECT DISTINCT
col1
,col2
,col3
,col4
,col5
FROM tbl1 r WITH ( NOLOCK )
INNER JOIN tbl2 p WITH ( NOLOCK ) ON p.col1= r.col1
INNER JOIN tbl3 l WITH ( NOLOCK ) ON l.col2 = r.col2
r.col1 NOT IN ( 'Foreclosure Deed',
'Foreclosure Deed - Judicial',
'Foreclosure RESPA',
'Foreclosure Vendor Assignment Review',
'Foreclosure Stop',
'Foreclosure Screenprints Other',
'Foreclosure Sale Audit',
'Foreclosure Property Preservation',
'Foreclosure Acquisition',
'Foreclosure Notices Attorney Certification' )
AND ( r.col1 LIKE 'foreclosure%'
OR r.col1 = 'Vesting CT'
);
我的tbl1包含1亿条记录,tbl2包含1亿条记录,而tbl3包含1000万条记录。我通过估计的执行计划了更多负载显示在Distinct。 注意:我在表格上应用了正确的索引。
我只是尝试使用批处理解决这个问题,如下面的
INSERT INTO TEMP1
SELECT SK_ID from tbl1 r where ( r.processname LIKE 'foreclosure%' OR r.processname = 'Vesting CT')
EXCEPT
SELECT SK_ID from tbl1 r where r.processname NOT IN ( 'Foreclosure Deed','Foreclosure Deed - Judicial',
'Foreclosure RESPA',
'Foreclosure Vendor Assignment Review',
'Foreclosure Stop',
'Foreclosure Screenprints Other',
'Foreclosure Sale Audit',
'Foreclosure Property Preservation',
'Foreclosure Acquisition',
'Foreclosure Notices Attorney Certification' )
-- Load data into staging table in batch mode
DECLARE @STARTID BIGINT=1, @LASTID BIGINT, @ENDID BIGINT;
DECLARE @SPLITCONFIG BIGINT =1000 -- Process 1000 records as batch
SELECT @LASTID = MAX(ID) FROM TEMP1
WHILE @STARTID < @LASTID
BEGIN
IF(@STARTID + @SPLITCONFIG > @LASTID)
SET @ENDID = @LASTID
ELSE
SET @ENDID = @STARTID + @SPLITCONFIG
INSERT INTO staging.lps_data
( col1
,col2
,col3
,col4
,col5)
SELECT DISTINCT
col1
,col2
,col3
,col4
,col5
FROM tbl1 r WITH (NOLOCK)
INNER JOIN TEMP1 SK WITH(NOLOCK) ON (r.SK_ID=SK.SK_ID AND SK.ID >=@STARTID AND SK.ID < @ENDID)
INNER JOIN tbl2 p WITH (NOLOCK) ON p.refinfoidentifier = r.refinfoidentifier
INNER JOIN tbl3 l WITH (NOLOCK) ON l.loaninfoidentifier = r.loaninfoidentifier
SET @STARTID = @ENDID
END
采用第一种方法,我的服务器因故障而崩溃,采用第二种方法,我可以在4小时内处理完整的记录。
请建议我,如果我能做任何其他事情来完成这个过程不到一小时
答案 0 :(得分:0)
不确定您的表中有哪些索引,但尝试使用SELECT
更改ROW_NUMBER()
而不是使用DISTINCT
SELECT
col1
,col2
,col3
,col4
,col5 FROM
(
SELECT
col1
,col2
,col3
,col4
,col5
,ROW_NUMBER() OVER(ORDER BY r.col1) as rn
FROM tbl1 r
INNER JOIN tbl2 p ON p.col1= r.col1
INNER JOIN tbl3 l ON l.col2 = r.col2
WHERE
r.col1 NOT IN ( 'Foreclosure Deed',
'Foreclosure Deed - Judicial',
'Foreclosure RESPA',
'Foreclosure Vendor Assignment Review',
'Foreclosure Stop',
'Foreclosure Screenprints Other',
'Foreclosure Sale Audit',
'Foreclosure Property Preservation',
'Foreclosure Acquisition',
'Foreclosure Notices Attorney Certification' )
AND ( r.col1 LIKE 'foreclosure%' OR r.col1 = 'Vesting CT')) xxx
WHERE rn = 1;
答案 1 :(得分:0)
尝试在NEW_TABLE中插入NOT IN的字符串,并将其与tbl1过滤连接WHERE r.col1 IS NULL(最好使用ID或整数而不是字符串)或使用r.col1 NOT EXISTS(SELECT 1 FROM NEW_TABLE WHERE。 ..)
再见,
伊戈尔
答案 2 :(得分:0)
那么什么是DISTINCT
什么时候不能重复?删除它,你的查询应该更快。
正如你所说的那些列在表中是唯一的,它们肯定会有索引,因此很可能表格本身不会被读取,而只有索引,因为它们已包含所有必需的数据。这是最好的。我认为这里无法进行优化。
(当然,对于大型表和索引,人们总是可以考虑进行分区以更快地获取数据。)