Question

我最近问了一个有关如何解决tsql查询中的问题的问题，这导致我使用了MERGE语句。但是事实证明这是有问题的，因为它的性能太差了。

How do I get inserted row Id, along with related data back after insert without inserting the related data

我需要做的是基于结果集插入行，并保存插入行的ID以及其产生的数据（请参阅相关问题）。

我最终遇到了这样的查询。

DECLARE @temp AS TABLE(
      [action] NVARCHAR(20)
     ,[GlobalId] BIGINT
     ,[Personnumber] NVARCHAR(100)
     ,[Firstname] NVARCHAR(100)
     ,[Lastname] NVARCHAR(100)
);

;WITH person AS
(
    SELECT top 1
        t.[Personnumber]
        ,t.[Firstname]
        ,t.[Lastname]
    FROM [temp].[RawRoles] t
    WHERE t.Personnumber NOT IN 
        (
            SELECT i.Account FROM [security].[Accounts] i
        )
)

MERGE [security].[Identities] AS tar
USING person AS src
ON 0 = 1 -- all rows from src need to be inserted, ive already filtered out using CTE Query.
WHEN NOT MATCHED THEN
   INSERT
   (
     [Created], [Updated]
   )
   VALUES
   (
        GETUTCDATE(), GETUTCDATE()
   )
OUTPUT $action, inserted.GlobalId, src.[Personnumber], src.[Firstname], src.[Lastname]  INTO @temp;


SELECT * FROM @temp

使用此查询，我将插入所有行，然后将它们与源值一起保存到临时表中，以供以后处理。

这在10k以下的行上效果很好。但是我要针对的数据集接近200万行。 我在未完成查询的情况下运行了约一个小时（在已升级的高级层Azure数据库上）。

问题：如何使它更快。如果没有合并，是否可以达到相同的结果？

Answer 1

在我看来，您的身份表仅被用作序列生成器，因为除了时间戳外，您没有在其中插入任何内容。您是否考虑过使用SEQUENCE代替表来生成密钥？使用序列可能会消除此过程，因为您可以在需要时生成密钥。

向表变量输出数百万行不太可能。表变量通常最多可用于数千行。

INSERT INTO security.Accounts (GlobalId, Account, Firstname, Lastname)
SELECT NEXT VALUE FOR AccountSeq, r.Personnumber, r.Firstname, r.Lastname
FROM temp.RawRoles AS r
LEFT JOIN security.Accounts AS a ON r.Personnumber = a.Account
WHERE a.Personnumber IS NULL;

INSERT INTO security.identities (GlobalId, Created, Updated)
SELECT a.GlobalId, GETUTCDATE() AS Created, GETUTCDATE() AS Updated
FROM security.Accounts AS a
LEFT JOIN security.identities AS i ON a.GlobalId = i.GlobalId
WHERE i.GlobalId IS NULL;

Answer 2

乍一看，MERGE似乎并不是降低性能的罪魁祸首。合并条件始终为false（0 = 1），插入（插入[security]。[Identities]）是唯一可能的前进路径/方式。

绕过[security]。[Identities]和MERGE，将200万行插入@temp需要多长时间？

DECLARE @temp AS TABLE(
      [action] NVARCHAR(20)
     ,[GlobalId] BIGINT
     ,[Personnumber] NVARCHAR(100)
     ,[Firstname] NVARCHAR(100)
     ,[Lastname] NVARCHAR(100)
);

--is this fast?!?
INSERT INTO @temp(action, GlobalId, Personnumber, Firstname, LastName)
SELECT 'insert', 0, t.[Personnumber], t.[Firstname], t.[Lastname]
FROM [temp].[RawRoles] t
WHERE t.Personnumber NOT IN 
(
    SELECT i.Account FROM [security].[Accounts] i
);

要检查：

[temp]。[RawRoles] .Personnumber的数据类型是什么？是Personnumber nvarchar（100）？
您是否需要在一个人的号码中存储外来字符？ Nchar是char的两倍。如果您使用字母数字（常见的拉丁字符）或带有前导零的数字，则varchar / char可能是一个更好的选择。如果数字数据类型可以满足您的要求，则首选int / bigint / decimal。
[temp]。[RawRoles]。人员编号上有索引吗？如果没有索引，则存在性检查将需要对[temp]。[RawRoles] .Personnumber进行排序或对其进行哈希处理。这可能是资源吞吐量/ dtu的额外成本。考虑到大多数temp.RawRoles将最终被处理/插入，因此[temp] .RawRoles的聚集索引可能是最有益的。
[security]。[Accounts] .Account的数据类型是什么？列上有索引吗？两列[security]。[Accounts] .Account＆[temp]。[RawRoles] .Personnumber应该是 same 数据类型，理想情况下，两者都应有索引。如果[security]。[Accounts]是已处理的[temp]。[RawRoles]的最终目的地，则该表可以容纳数百万行，并且在将来进行任何处理时都需要在Account列上有一个索引。索引的缺点是插入速度较慢。如果200万是头一个批量/数据，那么在将“批量”插入security.Accounts时最好不要在Account上建立索引（但要在以后创建）。

总结：

--contemplate&decide whether a change of the Account datatype is needed. (a datatype change can have many implications, for applications using the db)

--change the data type of Personnumber to the datatype of Account(security.Accounts)
ALTER TABLE temp.RawRoles ALTER COLUMN Personnumber "datatype of security.Accounts.Account" NOT NULL; -- rows having undefined Personnumber?

--clustered index Personnumber
CREATE /*UNIQUE*/ CLUSTERED INDEX uclxPersonnumber ON temp.RawRoles(Personnumber); --unique preferred, if possible

--index on account (not needed[?] when security.Accounts is empty)
CREATE INDEX idxAccount ON [security].Accounts(Account);


--baseline, how fast can we do a straight forward insertion of 2 million rows?
DECLARE @tempbaseline AS TABLE(
      [action] NVARCHAR(20)
     ,[GlobalId] BIGINT
     ,[Personnumber] NVARCHAR(100) --ignore this for now
     ,[Firstname] NVARCHAR(100)
     ,[Lastname] NVARCHAR(100)
);

INSERT INTO @tempbaseline([action], GlobalId, Personnumber, Firstname, LastName)
SELECT 'INSERT', 0, t.[Personnumber], t.[Firstname], t.[Lastname]
FROM [temp].[RawRoles] t
    WHERE NOT EXISTS (SELECT * FROM [security].[Accounts] i WHERE i.Account = t.Personnumber)    
--if the execution time (baseline) is acceptable, proceed with the merge code
--"merge with output into" should be be "slightly"/s slower than the baseline.
--if the baseline is not acceptable (simple insertion takes too much time) then merge is futile

/*
DECLARE @temp....


MERGE [security].[Identities] AS tar
USING 
(
    SELECT --top 1
        t.[Personnumber]
        ,t.[Firstname]
        ,t.[Lastname]
    FROM [temp].[RawRoles] t
    WHERE NOT EXISTS (SELECT * FROM [security].[Accounts] i WHERE i.Account = t.Personnumber)
) AS src
ON 0 = 1 -- all rows from src need to be inserted, ive already filtered out in the USING Query.
WHEN NOT MATCHED THEN
   INSERT
   (
     [Created], [Updated]
   )
   VALUES
   (
        GETUTCDATE(), GETUTCDATE()
   )
OUTPUT 'INSERT' /** only insert is possible $action */, inserted.GlobalId, src.[Personnumber], src.[Firstname], src.[Lastname]  INTO @temp;   


--delete the index on Account (the process will insert 2mil)
DROP INDEX idxAccount ON [security].Accounts --review and create this index after the bulk of accounts is inserted.

...your process

*/

改善可怕的合并性能

2 个答案: