我最近问了一个有关如何解决tsql查询中的问题的问题,这导致我使用了MERGE语句。但是事实证明这是有问题的,因为它的性能太差了。
我需要做的是基于结果集插入行,并保存插入行的ID以及其产生的数据(请参阅相关问题)。
我最终遇到了这样的查询。
DECLARE @temp AS TABLE(
[action] NVARCHAR(20)
,[GlobalId] BIGINT
,[Personnumber] NVARCHAR(100)
,[Firstname] NVARCHAR(100)
,[Lastname] NVARCHAR(100)
);
;WITH person AS
(
SELECT top 1
t.[Personnumber]
,t.[Firstname]
,t.[Lastname]
FROM [temp].[RawRoles] t
WHERE t.Personnumber NOT IN
(
SELECT i.Account FROM [security].[Accounts] i
)
)
MERGE [security].[Identities] AS tar
USING person AS src
ON 0 = 1 -- all rows from src need to be inserted, ive already filtered out using CTE Query.
WHEN NOT MATCHED THEN
INSERT
(
[Created], [Updated]
)
VALUES
(
GETUTCDATE(), GETUTCDATE()
)
OUTPUT $action, inserted.GlobalId, src.[Personnumber], src.[Firstname], src.[Lastname] INTO @temp;
SELECT * FROM @temp
使用此查询,我将插入所有行,然后将它们与源值一起保存到临时表中,以供以后处理。
这在10k以下的行上效果很好。但是我要针对的数据集接近200万行。 我在未完成查询的情况下运行了约一个小时(在已升级的高级层Azure数据库上)。
问题:如何使它更快。如果没有合并,是否可以达到相同的结果?
答案 0 :(得分:2)
在我看来,您的身份表仅被用作序列生成器,因为除了时间戳外,您没有在其中插入任何内容。您是否考虑过使用SEQUENCE代替表来生成密钥?使用序列可能会消除此过程,因为您可以在需要时生成密钥。
向表变量输出数百万行不太可能。表变量通常最多可用于数千行。
INSERT INTO security.Accounts (GlobalId, Account, Firstname, Lastname)
SELECT NEXT VALUE FOR AccountSeq, r.Personnumber, r.Firstname, r.Lastname
FROM temp.RawRoles AS r
LEFT JOIN security.Accounts AS a ON r.Personnumber = a.Account
WHERE a.Personnumber IS NULL;
INSERT INTO security.identities (GlobalId, Created, Updated)
SELECT a.GlobalId, GETUTCDATE() AS Created, GETUTCDATE() AS Updated
FROM security.Accounts AS a
LEFT JOIN security.identities AS i ON a.GlobalId = i.GlobalId
WHERE i.GlobalId IS NULL;
答案 1 :(得分:0)
乍一看,MERGE似乎并不是降低性能的罪魁祸首。 合并条件始终为false(0 = 1),插入(插入[security]。[Identities])是唯一可能的前进路径/方式。
绕过[security]。[Identities]和MERGE,将200万行插入@temp需要多长时间?
DECLARE @temp AS TABLE(
[action] NVARCHAR(20)
,[GlobalId] BIGINT
,[Personnumber] NVARCHAR(100)
,[Firstname] NVARCHAR(100)
,[Lastname] NVARCHAR(100)
);
--is this fast?!?
INSERT INTO @temp(action, GlobalId, Personnumber, Firstname, LastName)
SELECT 'insert', 0, t.[Personnumber], t.[Firstname], t.[Lastname]
FROM [temp].[RawRoles] t
WHERE t.Personnumber NOT IN
(
SELECT i.Account FROM [security].[Accounts] i
);
要检查:
[temp]。[RawRoles] .Personnumber的数据类型是什么?
是Personnumber nvarchar(100)?
您是否需要在一个人的号码中存储外来字符?
Nchar是char的两倍。如果您使用字母数字(常见的拉丁字符)或带有前导零的数字,则varchar / char可能是一个更好的选择。如果数字数据类型可以满足您的要求,则首选int / bigint / decimal。
[temp]。[RawRoles]。人员编号上有索引吗? 如果没有索引,则存在性检查将需要对[temp]。[RawRoles] .Personnumber进行排序或对其进行哈希处理。这可能是资源吞吐量/ dtu的额外成本。考虑到大多数temp.RawRoles将最终被处理/插入,因此[temp] .RawRoles的聚集索引可能是最有益的。
[security]。[Accounts] .Account的数据类型是什么?列上有索引吗?两列[security]。[Accounts] .Account&[temp]。[RawRoles] .Personnumber应该是 same 数据类型,理想情况下,两者都应有索引。 如果[security]。[Accounts]是已处理的[temp]。[RawRoles]的最终目的地,则该表可以容纳数百万行,并且在将来进行任何处理时都需要在Account列上有一个索引。索引的缺点是插入速度较慢。如果200万是头一个批量/数据,那么在将“批量”插入security.Accounts时最好不要在Account上建立索引(但要在以后创建)。
总结:
--contemplate&decide whether a change of the Account datatype is needed. (a datatype change can have many implications, for applications using the db)
--change the data type of Personnumber to the datatype of Account(security.Accounts)
ALTER TABLE temp.RawRoles ALTER COLUMN Personnumber "datatype of security.Accounts.Account" NOT NULL; -- rows having undefined Personnumber?
--clustered index Personnumber
CREATE /*UNIQUE*/ CLUSTERED INDEX uclxPersonnumber ON temp.RawRoles(Personnumber); --unique preferred, if possible
--index on account (not needed[?] when security.Accounts is empty)
CREATE INDEX idxAccount ON [security].Accounts(Account);
--baseline, how fast can we do a straight forward insertion of 2 million rows?
DECLARE @tempbaseline AS TABLE(
[action] NVARCHAR(20)
,[GlobalId] BIGINT
,[Personnumber] NVARCHAR(100) --ignore this for now
,[Firstname] NVARCHAR(100)
,[Lastname] NVARCHAR(100)
);
INSERT INTO @tempbaseline([action], GlobalId, Personnumber, Firstname, LastName)
SELECT 'INSERT', 0, t.[Personnumber], t.[Firstname], t.[Lastname]
FROM [temp].[RawRoles] t
WHERE NOT EXISTS (SELECT * FROM [security].[Accounts] i WHERE i.Account = t.Personnumber)
--if the execution time (baseline) is acceptable, proceed with the merge code
--"merge with output into" should be be "slightly"/s slower than the baseline.
--if the baseline is not acceptable (simple insertion takes too much time) then merge is futile
/*
DECLARE @temp....
MERGE [security].[Identities] AS tar
USING
(
SELECT --top 1
t.[Personnumber]
,t.[Firstname]
,t.[Lastname]
FROM [temp].[RawRoles] t
WHERE NOT EXISTS (SELECT * FROM [security].[Accounts] i WHERE i.Account = t.Personnumber)
) AS src
ON 0 = 1 -- all rows from src need to be inserted, ive already filtered out in the USING Query.
WHEN NOT MATCHED THEN
INSERT
(
[Created], [Updated]
)
VALUES
(
GETUTCDATE(), GETUTCDATE()
)
OUTPUT 'INSERT' /** only insert is possible $action */, inserted.GlobalId, src.[Personnumber], src.[Firstname], src.[Lastname] INTO @temp;
--delete the index on Account (the process will insert 2mil)
DROP INDEX idxAccount ON [security].Accounts --review and create this index after the bulk of accounts is inserted.
...your process
*/