SQL Server - 优化并提高此过程的效率

时间:2018-06-16 15:36:09

标签: sql-server stored-procedures azure-sql-database sql-server-2016

Azure SQL Server 2016 w / Azure SQL数据库:

我们每24小时运行一次此程序。目前,该程序的执行平均每天运行10-12小时。

以下是程序:

EXEC PROCEDURE [dbo].[SyncDataFromTransfer]
AS
BEGIN

DECLARE @taskSourceId bigint,
        @syncBatch uniqueidentifier = newid()

DECLARE curSources cursor
FOR
SELECT  TaskSourceId
FROM    [Transfer].CaptureSync
GROUP BY TaskSourceId

OPEN curSources

FETCH NEXT FROM curSources
INTO @taskSourceId

WHILE @@fetch_status = 0
BEGIN

-- insert the rows into the capture table, but do not flag as active
INSERT INTO [Data].Capture (
     [TaskSourceId]
    ,[Identifier]
    ,[IndividualName]
    ,[EntityName]
    ,[Text]
    ,[RawSource]
    ,[CaptureBatch]
    ,[IsActive]
    ,[CaptureDateTime]
    ,[SyncBatchId]
    )
SELECT  [TaskSourceId]
    ,[Identifier]
    ,[IndividualName]
    ,[EntityName]
    ,[Text]
    ,[RawSource]
    ,[CaptureBatch]
    ,0 -- isActive
    ,[CaptureDateTime]
    ,@syncBatch -- SyncBatchId

FROM    [Transfer].CaptureSync
WHERE   TaskSourceId = @taskSourceId

-- flag the new rows as active
UPDATE [Data].Capture
SET IsActive = 1
WHERE TaskSourceId = @taskSourceId
  AND SyncBatchId = @syncBatch

-- remove the existing rows
DELETE [Data].Capture
WHERE TaskSourceId = @taskSourceId
  AND SyncBatchId != @syncBatch

-- get the next source
FETCH NEXT FROM curSources INTO @taskSourceId

END -- end of the cursor

CLOSE curSources
DEALLOCATE curSources

END
GO

简而言之:Data.Capture是我们的生产表,其中包含我们所有的最新记录。此表每24小时刷新一次(通过上述存储过程)以创建最新信息的数据集。 Data.Capture中的记录按TaskSourceId分组。一个TaskSourceId有许多来自同一网络源的记录。

每24小时一次,网络抓取工具将数据写入Transfer.CaptureSync表,该表充当保持表。此存储过程的目的是然后一次性进入Transfer.CaptureSync和一个TaskSourceId的记录,替换与Data.Capture相关的TaskSourceId中的记录组,因此我们始终在Data.Capture中获取给定TaskSourceId的最新信息。

但是,并非所有 TaskSourceId's每天都有新记录,而且每天都有新记录。因此,当我们在给定的一天没有给定TaskSourceId的新记录时,我们只想留下Data.Capture中已有的最新记录。

我希望这种解释有意义 - 回顾一下:

  1. 从网站获取最新信息并写信至Transfer.CaptureSync

  2. TaskSourceId中的信息替换为Transfer.CaptureSync中的信息。

  3. 如果TaskSourceId中不存在给定Transfer.CaptureSync的一组记录,请将TaskSourceId中与该Data.Capture相关的最后一组记录保留在Data.Capture中不变。

  4. 每天完成该过程后,Transfer.CaptureSync表将被截断。

    Transfer.CaptureSync表大约有400万条记录,而Foo表每天大约有300万条记录。

    鉴于所有这些信息,我们回到手头的问题。这个程序平均每天运行10-12个小时,使用宝贵的资源很长一段时间。

    这是实现这一目标的最有效和最佳方法吗?我意识到“最有效和最优”在某种程度上是主观的。我期待SQL专家,我不是,输入。

2 个答案:

答案 0 :(得分:1)

你看过分区表吗?

如果您通过TaskSourceId对Transfer.CaptureSync表和Data.Capture表进行分区,则应该能够切换并切换分区。将分区切换作为事务处理,传输过程应该减少到几秒而不是几小时,同时保持数据完整性

再考虑一下:

这假设表是相同的结构。如果没有,您可以拥有一个分区的临时表,其结构与Data.Capture表相同。数据的分段不应该花费太多时间,因为您所做的只是复制数据,而不进行任何特定的更新或删除。

正如所讨论的分区的示例(确保您不在生产中执行此操作或具有名为&#34的任何重要数据库; Playground") :):

USE [master];
GO
-- Just Creating a dummy database here
IF EXISTS ( SELECT * FROM sys.databases WHERE name = 'PlayGround' )
BEGIN
    ALTER DATABASE [Playground] SET SINGLE_USER WITH ROLLBACK IMMEDIATE;

    DROP DATABASE [Playground];
END;
GO

CREATE DATABASE Playground;
GO

USE [Playground];
GO
-- Creating the Data Table with required partitions
CREATE SCHEMA [Data] AUTHORIZATION dbo;
GO

CREATE TABLE [Data].Capture
(
    [TaskSourceId]     INT
    ,[Identifier]      BIGINT
    ,[IndividualName]  VARCHAR(255)
    ,[EntityName]      VARCHAR(255)
    ,[Text]            VARCHAR(400)
    ,[RawSource]       VARCHAR(200)
    ,[CaptureBatch]    UNIQUEIDENTIFIER
    ,[IsActive]        BIT
    ,[CaptureDateTime] DATETIME2(7)
    ,[SyncBatchId]     UNIQUEIDENTIFIER
);

CREATE PARTITION FUNCTION PF_DataCapture ( INT )
AS RANGE RIGHT FOR VALUES ( 1, 2, 3, 4, 5 );

CREATE PARTITION SCHEME PS_DataCapture
AS PARTITION PF_DataCapture
TO ( [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY] );

CREATE CLUSTERED INDEX IXC_DataCapture_TaskSourceID
ON [Data].Capture ( TaskSourceId )
ON PS_DataCapture([TaskSourceId]);
GO

-- Creating the Staging Table with required partitions
CREATE SCHEMA Staging AUTHORIZATION DBO;
GO

CREATE TABLE [Staging].Capture
(
    [TaskSourceId]     INT
    ,[Identifier]      BIGINT
    ,[IndividualName]  VARCHAR(255)
    ,[EntityName]      VARCHAR(255)
    ,[Text]            VARCHAR(400)
    ,[RawSource]       VARCHAR(200)
    ,[CaptureBatch]    UNIQUEIDENTIFIER
    ,[IsActive]        BIT
    ,[CaptureDateTime] DATETIME2(7)
    ,[SyncBatchId]     UNIQUEIDENTIFIER
);

CREATE PARTITION FUNCTION PF_StagingCapture ( INT )
AS RANGE RIGHT FOR VALUES ( 1, 2, 3, 4, 5 );

CREATE PARTITION SCHEME PS_StagingCapture
AS PARTITION PF_StagingCapture
TO ( [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY] );

CREATE CLUSTERED INDEX IXC_StagingCapture_TaskSourceID
ON [Staging].Capture ( TaskSourceId )
ON PS_StagingCapture([TaskSourceId]);
GO


-- Creating an archive table with required partitions just so that we can easily transfer out of the data table
CREATE SCHEMA Archive AUTHORIZATION DBO;
GO

CREATE TABLE [Archive].Capture
(
    [TaskSourceId]     INT
    ,[Identifier]      BIGINT
    ,[IndividualName]  VARCHAR(255)
    ,[EntityName]      VARCHAR(255)
    ,[Text]            VARCHAR(400)
    ,[RawSource]       VARCHAR(200)
    ,[CaptureBatch]    UNIQUEIDENTIFIER
    ,[IsActive]        BIT
    ,[CaptureDateTime] DATETIME2(7)
    ,[SyncBatchId]     UNIQUEIDENTIFIER
);

CREATE PARTITION FUNCTION PF_ArchiveCapture ( INT )
AS RANGE RIGHT FOR VALUES ( 1, 2, 3, 4, 5 );

CREATE PARTITION SCHEME PS_ArchiveCapture
AS PARTITION PF_ArchiveCapture
TO ( [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY] );

CREATE CLUSTERED INDEX IXC_ArchiveCapture_TaskSourceID
ON [Archive].Capture ( TaskSourceId )
ON PS_ArchiveCapture([TaskSourceId]);
GO


--Lets insert some data into the staging table (this can be your population from Transfer.CaptureSync) 
DECLARE @SyncBatchId UNIQUEIDENTIFIER = NEWID();

INSERT INTO Staging.[Capture]
(
    [TaskSourceId]
    ,[Identifier]
    ,[IndividualName]
    ,[EntityName]
    ,[Text]
    ,[RawSource]
    ,[CaptureBatch]
    ,[IsActive]
    ,[CaptureDateTime]
    ,[SyncBatchId]
)
VALUES
(
    1                         -- TaskSourceId - int
    ,1                        -- Identifier - bigint
    ,'Insert Name Here'       -- IndividualName - varchar(255)
    ,'Insert EntityName Here' -- EntityName - varchar(255)
    ,'Insert Text Here'       -- Text - varchar(400)
    ,'Insert RawSource Here'  -- RawSource - varchar(200)
    ,NEWID()                  -- CaptureBatch - uniqueidentifier
    ,1                        -- IsActive - bit
    ,SYSDATETIME()            -- CaptureDateTime - datetime2(7)
    ,@SyncBatchId             -- SyncBatchId - uniqueidentifier
)
,(
     3                         -- TaskSourceId - int
     ,4                        -- Identifier - bigint
     ,'Insert Name Here'       -- IndividualName - varchar(255)
     ,'Insert EntityName Here' -- EntityName - varchar(255)
     ,'Insert Text Here'       -- Text - varchar(400)
     ,'Insert RawSource Here'  -- RawSource - varchar(200)
     ,NEWID()                  -- CaptureBatch - uniqueidentifier
     ,1                        -- IsActive - bit
     ,SYSDATETIME()            -- CaptureDateTime - datetime2(7)
     ,@SyncBatchId             -- SyncBatchId - uniqueidentifier
 )
,(
     1                         -- TaskSourceId - int
     ,3                        -- Identifier - bigint
     ,'Insert Name Here'       -- IndividualName - varchar(255)
     ,'Insert EntityName Here' -- EntityName - varchar(255)
     ,'Insert Text Here'       -- Text - varchar(400)
     ,'Insert RawSource Here'  -- RawSource - varchar(200)
     ,NEWID()                  -- CaptureBatch - uniqueidentifier
     ,1                        -- IsActive - bit
     ,SYSDATETIME()            -- CaptureDateTime - datetime2(7)
     ,@SyncBatchId             -- SyncBatchId - uniqueidentifier
 );
GO

CREATE PROCEDURE TransferSync
AS
BEGIN
    DECLARE @TaskSourceId INT;
    DECLARE @PartitionNo INT;
    DECLARE @SwitchPartitionSQL VARCHAR(4000);

    DECLARE curSources CURSOR FOR
    SELECT
        TaskSourceId
    FROM
        [Staging].Capture
    GROUP BY
        TaskSourceId;

    TRUNCATE TABLE [Archive].[Capture]; -- We need the partitions to be empty on this table

    OPEN curSources;

    FETCH NEXT FROM curSources
    INTO
        @TaskSourceId;

    WHILE @@Fetch_Status = 0
    BEGIN
        -- Finding the partition number the data for our @TransferSourceID is in
        SELECT
            --OBJECT_NAME(p.object_id)                                             AS TableName
            --,s.name                                                              AS SchemaName
            --,i.name                                                              AS IndexName
            --,p.index_id                                                          AS IndexID
            --,ds.name                                                             AS PartitionScheme
            @PartitionNo = p.partition_number --AS PartitionNumber
        --,fg.name                                                             AS FileGroupName
        --,prv_left.value                                                      AS LowerBoundaryValue
        --,prv_right.value                                                     AS UpperBoundaryValue
        --,CASE pf.boundary_value_on_right WHEN 1 THEN 'RIGHT' ELSE 'LEFT' END AS Range
        --,p.rows                                                              AS Rows
        FROM
            sys.partitions                       AS p
            JOIN sys.objects                     AS o
                ON p.object_id              = o.object_id
            JOIN sys.indexes                     AS i
                ON i.object_id              = p.object_id AND i.index_id = p.index_id
            JOIN sys.schemas                     AS s
                ON s.schema_id              = o.schema_id
            JOIN sys.data_spaces                 AS ds
                ON ds.data_space_id         = i.data_space_id
            JOIN sys.partition_schemes           AS ps
                ON ps.data_space_id         = ds.data_space_id
            JOIN sys.partition_functions         AS pf
                ON pf.function_id           = ps.function_id
            JOIN sys.destination_data_spaces     AS dds2
                ON dds2.partition_scheme_id = ps.data_space_id AND dds2.destination_id = p.partition_number
            JOIN sys.filegroups                  AS fg
                ON fg.data_space_id         = dds2.data_space_id
            LEFT JOIN sys.partition_range_values AS prv_left
                ON ps.function_id           = prv_left.function_id AND prv_left.boundary_id = p.partition_number - 1
            LEFT JOIN sys.partition_range_values AS prv_right
                ON ps.function_id           = prv_right.function_id AND prv_right.boundary_id = p.partition_number
        WHERE OBJECTPROPERTY(p.object_id, 'ISMSShipped') = 0
              AND OBJECT_NAME(p.object_id)               = 'Capture'
              AND SCHEMA_NAME(o.schema_id)               = 'Data'
              AND [prv_left].[value]                     = @TaskSourceId;

        SELECT
            @SwitchPartitionSQL = '
    BEGIN TRAN;
        ALTER TABLE Data.[Capture] SWITCH PARTITION ' + CONVERT(VARCHAR(10), @PartitionNo)
                                  + ' TO [Archive].[Capture] PARTITION ' + CONVERT(VARCHAR(10), @PartitionNo)
                                  + '
        ALTER TABLE Staging.[Capture] SWITCH PARTITION ' + CONVERT(VARCHAR(10), @PartitionNo)
                                  + ' TO [Data].[Capture] PARTITION ' + CONVERT(VARCHAR(10), @PartitionNo)
                                  + ' 
    COMMIT TRAN;
    '   ;
        -- Partition switching magic here :)
        --PRINT @SwitchPartitionSQL;

        EXEC ( @SwitchPartitionSQL );

        FETCH NEXT FROM curSources
        INTO
            @TaskSourceId;
    END;

    CLOSE curSources;
    DEALLOCATE curSources;
END;

EXEC [dbo].[TransferSync] 

现在我只使用了三行,但是切换过程是一个元数据操作,并且接近瞬时。无论Staging表是3行还是100万行

,该过程大约需要相同的时间

答案 1 :(得分:1)

[CaptureSync]每天被截断,因此通过加入TaskSourceId上的[CAPTURESYNC]删除CURSOR,INSERT到[CAPTURE]。这只会影响[CAPTURESYNC]中存在TaskSourceId的行。

你可以离开UPDATE,但是如果你问到为什么插入新行时InActive设置为0的商业原因你可能找不到。(我看不出技术原因)如果修改了哪个案例INSA for InActive为1。

应更改DELETE以在TaskSourceId上考虑[CAPTURESYNC]行。

Delete C
From Capture C
Inner join DataCapture D on D.TaskSourceId = C.TaskSourceId 
                    And C.SyncBatchId <> @SyncBatch