我们每24小时运行一次此程序。目前,该程序的执行平均每天运行10-12小时。
以下是程序:
EXEC PROCEDURE [dbo].[SyncDataFromTransfer]
AS
BEGIN
DECLARE @taskSourceId bigint,
@syncBatch uniqueidentifier = newid()
DECLARE curSources cursor
FOR
SELECT TaskSourceId
FROM [Transfer].CaptureSync
GROUP BY TaskSourceId
OPEN curSources
FETCH NEXT FROM curSources
INTO @taskSourceId
WHILE @@fetch_status = 0
BEGIN
-- insert the rows into the capture table, but do not flag as active
INSERT INTO [Data].Capture (
[TaskSourceId]
,[Identifier]
,[IndividualName]
,[EntityName]
,[Text]
,[RawSource]
,[CaptureBatch]
,[IsActive]
,[CaptureDateTime]
,[SyncBatchId]
)
SELECT [TaskSourceId]
,[Identifier]
,[IndividualName]
,[EntityName]
,[Text]
,[RawSource]
,[CaptureBatch]
,0 -- isActive
,[CaptureDateTime]
,@syncBatch -- SyncBatchId
FROM [Transfer].CaptureSync
WHERE TaskSourceId = @taskSourceId
-- flag the new rows as active
UPDATE [Data].Capture
SET IsActive = 1
WHERE TaskSourceId = @taskSourceId
AND SyncBatchId = @syncBatch
-- remove the existing rows
DELETE [Data].Capture
WHERE TaskSourceId = @taskSourceId
AND SyncBatchId != @syncBatch
-- get the next source
FETCH NEXT FROM curSources INTO @taskSourceId
END -- end of the cursor
CLOSE curSources
DEALLOCATE curSources
END
GO
简而言之:Data.Capture
是我们的生产表,其中包含我们所有的最新记录。此表每24小时刷新一次(通过上述存储过程)以创建最新信息的数据集。 Data.Capture
中的记录按TaskSourceId
分组。一个TaskSourceId
有许多来自同一网络源的记录。
每24小时一次,网络抓取工具将数据写入Transfer.CaptureSync
表,该表充当保持表。此存储过程的目的是然后一次性进入Transfer.CaptureSync
和一个TaskSourceId
的记录,替换与Data.Capture
相关的TaskSourceId
中的记录组,因此我们始终在Data.Capture
中获取给定TaskSourceId
的最新信息。
但是,并非所有 TaskSourceId's
每天都有新记录,而且每天都有新记录。因此,当我们在给定的一天没有给定TaskSourceId
的新记录时,我们只想留下Data.Capture
中已有的最新记录。
我希望这种解释有意义 - 回顾一下:
从网站获取最新信息并写信至Transfer.CaptureSync
将TaskSourceId
中的信息替换为Transfer.CaptureSync
中的信息。
如果TaskSourceId
中不存在给定Transfer.CaptureSync
的一组记录,请将TaskSourceId
中与该Data.Capture
相关的最后一组记录保留在Data.Capture
中不变。
每天完成该过程后,Transfer.CaptureSync表将被截断。
Transfer.CaptureSync
表大约有400万条记录,而Foo
表每天大约有300万条记录。
鉴于所有这些信息,我们回到手头的问题。这个程序平均每天运行10-12个小时,使用宝贵的资源很长一段时间。
这是实现这一目标的最有效和最佳方法吗?我意识到“最有效和最优”在某种程度上是主观的。我期待SQL专家,我不是,输入。
答案 0 :(得分:1)
你看过分区表吗?
如果您通过TaskSourceId对Transfer.CaptureSync表和Data.Capture表进行分区,则应该能够切换并切换分区。将分区切换作为事务处理,传输过程应该减少到几秒而不是几小时,同时保持数据完整性
再考虑一下:
这假设表是相同的结构。如果没有,您可以拥有一个分区的临时表,其结构与Data.Capture表相同。数据的分段不应该花费太多时间,因为您所做的只是复制数据,而不进行任何特定的更新或删除。
正如所讨论的分区的示例(确保您不在生产中执行此操作或具有名为&#34的任何重要数据库; Playground") :):
USE [master];
GO
-- Just Creating a dummy database here
IF EXISTS ( SELECT * FROM sys.databases WHERE name = 'PlayGround' )
BEGIN
ALTER DATABASE [Playground] SET SINGLE_USER WITH ROLLBACK IMMEDIATE;
DROP DATABASE [Playground];
END;
GO
CREATE DATABASE Playground;
GO
USE [Playground];
GO
-- Creating the Data Table with required partitions
CREATE SCHEMA [Data] AUTHORIZATION dbo;
GO
CREATE TABLE [Data].Capture
(
[TaskSourceId] INT
,[Identifier] BIGINT
,[IndividualName] VARCHAR(255)
,[EntityName] VARCHAR(255)
,[Text] VARCHAR(400)
,[RawSource] VARCHAR(200)
,[CaptureBatch] UNIQUEIDENTIFIER
,[IsActive] BIT
,[CaptureDateTime] DATETIME2(7)
,[SyncBatchId] UNIQUEIDENTIFIER
);
CREATE PARTITION FUNCTION PF_DataCapture ( INT )
AS RANGE RIGHT FOR VALUES ( 1, 2, 3, 4, 5 );
CREATE PARTITION SCHEME PS_DataCapture
AS PARTITION PF_DataCapture
TO ( [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY] );
CREATE CLUSTERED INDEX IXC_DataCapture_TaskSourceID
ON [Data].Capture ( TaskSourceId )
ON PS_DataCapture([TaskSourceId]);
GO
-- Creating the Staging Table with required partitions
CREATE SCHEMA Staging AUTHORIZATION DBO;
GO
CREATE TABLE [Staging].Capture
(
[TaskSourceId] INT
,[Identifier] BIGINT
,[IndividualName] VARCHAR(255)
,[EntityName] VARCHAR(255)
,[Text] VARCHAR(400)
,[RawSource] VARCHAR(200)
,[CaptureBatch] UNIQUEIDENTIFIER
,[IsActive] BIT
,[CaptureDateTime] DATETIME2(7)
,[SyncBatchId] UNIQUEIDENTIFIER
);
CREATE PARTITION FUNCTION PF_StagingCapture ( INT )
AS RANGE RIGHT FOR VALUES ( 1, 2, 3, 4, 5 );
CREATE PARTITION SCHEME PS_StagingCapture
AS PARTITION PF_StagingCapture
TO ( [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY] );
CREATE CLUSTERED INDEX IXC_StagingCapture_TaskSourceID
ON [Staging].Capture ( TaskSourceId )
ON PS_StagingCapture([TaskSourceId]);
GO
-- Creating an archive table with required partitions just so that we can easily transfer out of the data table
CREATE SCHEMA Archive AUTHORIZATION DBO;
GO
CREATE TABLE [Archive].Capture
(
[TaskSourceId] INT
,[Identifier] BIGINT
,[IndividualName] VARCHAR(255)
,[EntityName] VARCHAR(255)
,[Text] VARCHAR(400)
,[RawSource] VARCHAR(200)
,[CaptureBatch] UNIQUEIDENTIFIER
,[IsActive] BIT
,[CaptureDateTime] DATETIME2(7)
,[SyncBatchId] UNIQUEIDENTIFIER
);
CREATE PARTITION FUNCTION PF_ArchiveCapture ( INT )
AS RANGE RIGHT FOR VALUES ( 1, 2, 3, 4, 5 );
CREATE PARTITION SCHEME PS_ArchiveCapture
AS PARTITION PF_ArchiveCapture
TO ( [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY], [PRIMARY] );
CREATE CLUSTERED INDEX IXC_ArchiveCapture_TaskSourceID
ON [Archive].Capture ( TaskSourceId )
ON PS_ArchiveCapture([TaskSourceId]);
GO
--Lets insert some data into the staging table (this can be your population from Transfer.CaptureSync)
DECLARE @SyncBatchId UNIQUEIDENTIFIER = NEWID();
INSERT INTO Staging.[Capture]
(
[TaskSourceId]
,[Identifier]
,[IndividualName]
,[EntityName]
,[Text]
,[RawSource]
,[CaptureBatch]
,[IsActive]
,[CaptureDateTime]
,[SyncBatchId]
)
VALUES
(
1 -- TaskSourceId - int
,1 -- Identifier - bigint
,'Insert Name Here' -- IndividualName - varchar(255)
,'Insert EntityName Here' -- EntityName - varchar(255)
,'Insert Text Here' -- Text - varchar(400)
,'Insert RawSource Here' -- RawSource - varchar(200)
,NEWID() -- CaptureBatch - uniqueidentifier
,1 -- IsActive - bit
,SYSDATETIME() -- CaptureDateTime - datetime2(7)
,@SyncBatchId -- SyncBatchId - uniqueidentifier
)
,(
3 -- TaskSourceId - int
,4 -- Identifier - bigint
,'Insert Name Here' -- IndividualName - varchar(255)
,'Insert EntityName Here' -- EntityName - varchar(255)
,'Insert Text Here' -- Text - varchar(400)
,'Insert RawSource Here' -- RawSource - varchar(200)
,NEWID() -- CaptureBatch - uniqueidentifier
,1 -- IsActive - bit
,SYSDATETIME() -- CaptureDateTime - datetime2(7)
,@SyncBatchId -- SyncBatchId - uniqueidentifier
)
,(
1 -- TaskSourceId - int
,3 -- Identifier - bigint
,'Insert Name Here' -- IndividualName - varchar(255)
,'Insert EntityName Here' -- EntityName - varchar(255)
,'Insert Text Here' -- Text - varchar(400)
,'Insert RawSource Here' -- RawSource - varchar(200)
,NEWID() -- CaptureBatch - uniqueidentifier
,1 -- IsActive - bit
,SYSDATETIME() -- CaptureDateTime - datetime2(7)
,@SyncBatchId -- SyncBatchId - uniqueidentifier
);
GO
CREATE PROCEDURE TransferSync
AS
BEGIN
DECLARE @TaskSourceId INT;
DECLARE @PartitionNo INT;
DECLARE @SwitchPartitionSQL VARCHAR(4000);
DECLARE curSources CURSOR FOR
SELECT
TaskSourceId
FROM
[Staging].Capture
GROUP BY
TaskSourceId;
TRUNCATE TABLE [Archive].[Capture]; -- We need the partitions to be empty on this table
OPEN curSources;
FETCH NEXT FROM curSources
INTO
@TaskSourceId;
WHILE @@Fetch_Status = 0
BEGIN
-- Finding the partition number the data for our @TransferSourceID is in
SELECT
--OBJECT_NAME(p.object_id) AS TableName
--,s.name AS SchemaName
--,i.name AS IndexName
--,p.index_id AS IndexID
--,ds.name AS PartitionScheme
@PartitionNo = p.partition_number --AS PartitionNumber
--,fg.name AS FileGroupName
--,prv_left.value AS LowerBoundaryValue
--,prv_right.value AS UpperBoundaryValue
--,CASE pf.boundary_value_on_right WHEN 1 THEN 'RIGHT' ELSE 'LEFT' END AS Range
--,p.rows AS Rows
FROM
sys.partitions AS p
JOIN sys.objects AS o
ON p.object_id = o.object_id
JOIN sys.indexes AS i
ON i.object_id = p.object_id AND i.index_id = p.index_id
JOIN sys.schemas AS s
ON s.schema_id = o.schema_id
JOIN sys.data_spaces AS ds
ON ds.data_space_id = i.data_space_id
JOIN sys.partition_schemes AS ps
ON ps.data_space_id = ds.data_space_id
JOIN sys.partition_functions AS pf
ON pf.function_id = ps.function_id
JOIN sys.destination_data_spaces AS dds2
ON dds2.partition_scheme_id = ps.data_space_id AND dds2.destination_id = p.partition_number
JOIN sys.filegroups AS fg
ON fg.data_space_id = dds2.data_space_id
LEFT JOIN sys.partition_range_values AS prv_left
ON ps.function_id = prv_left.function_id AND prv_left.boundary_id = p.partition_number - 1
LEFT JOIN sys.partition_range_values AS prv_right
ON ps.function_id = prv_right.function_id AND prv_right.boundary_id = p.partition_number
WHERE OBJECTPROPERTY(p.object_id, 'ISMSShipped') = 0
AND OBJECT_NAME(p.object_id) = 'Capture'
AND SCHEMA_NAME(o.schema_id) = 'Data'
AND [prv_left].[value] = @TaskSourceId;
SELECT
@SwitchPartitionSQL = '
BEGIN TRAN;
ALTER TABLE Data.[Capture] SWITCH PARTITION ' + CONVERT(VARCHAR(10), @PartitionNo)
+ ' TO [Archive].[Capture] PARTITION ' + CONVERT(VARCHAR(10), @PartitionNo)
+ '
ALTER TABLE Staging.[Capture] SWITCH PARTITION ' + CONVERT(VARCHAR(10), @PartitionNo)
+ ' TO [Data].[Capture] PARTITION ' + CONVERT(VARCHAR(10), @PartitionNo)
+ '
COMMIT TRAN;
' ;
-- Partition switching magic here :)
--PRINT @SwitchPartitionSQL;
EXEC ( @SwitchPartitionSQL );
FETCH NEXT FROM curSources
INTO
@TaskSourceId;
END;
CLOSE curSources;
DEALLOCATE curSources;
END;
EXEC [dbo].[TransferSync]
现在我只使用了三行,但是切换过程是一个元数据操作,并且接近瞬时。无论Staging表是3行还是100万行
,该过程大约需要相同的时间答案 1 :(得分:1)
[CaptureSync]每天被截断,因此通过加入TaskSourceId上的[CAPTURESYNC]删除CURSOR,INSERT到[CAPTURE]。这只会影响[CAPTURESYNC]中存在TaskSourceId的行。
你可以离开UPDATE,但是如果你问到为什么插入新行时InActive设置为0的商业原因你可能找不到。(我看不出技术原因)如果修改了哪个案例INSA for InActive为1。
应更改DELETE以在TaskSourceId上考虑[CAPTURESYNC]行。
Delete C
From Capture C
Inner join DataCapture D on D.TaskSourceId = C.TaskSourceId
And C.SyncBatchId <> @SyncBatch