当我从不同的源(Json文件,其他数据库和REST API)导入各种数据时,我需要对其进行重复数据删除,首先我将它们加载到单个表中,该表定义了它们的类型并将数据存储为Json,所以稍后当我运行批处理时,我可以查找类型并将数据插入到合适的表中。导入的行数是不同的(每种类型转到不同的表/表),但总是超过100万(如果我使用VARCHAR(MAX)
将它们以Json格式放在单个表中,则总共~10G数据)。
正如我所提到的,我需要处理重复项,因此我尝试为目标表定义唯一索引并启用Ignore Duplicate Keys
,这只会在我插入现有数据时发出警告。问题是,这只适用于少数情况。大多数情况下,我需要使用5 + varchar(255)
个字段,因为限制(900字节,src),我无法将它们添加到唯一索引中。
我正在努力的另一件事是,在批量插入期间,我需要插入关系数据,这意味着一个表将具有另一个表的外键。所以首先我需要处理依赖项,在我得到它们的插入ID后,使用那些我可以插入数据。就像产品有制造商一样,首先我在当前批次中插入所有制造商名称,然后使用那些我可以插入产品的ID。
需要返回ID并进行重复数据删除会导致我想要实现的查询:
首先,我试图通过制作这样的存储过程来处理这个问题:
此代码示例。:
CREATE PROCEDURE [dbo].usp_insert_pdproductdetails
@GDDataSourceVersionId INT,
@ManufacturerNameId BIGINT,
@ManufacturerReference NVARCHAR(255),
@PropertiesJson NVARCHAR(MAX),
@OriginalContentPage NVARCHAR(MAX),
@NewId BIGINT OUT
AS
BEGIN
SET NOCOUNT ON;
SELECT @NewId = [Id] FROM PDProductDetails
WHERE GDDataSourceVersionId = @GDDataSourceVersionId AND
ManufacturerId = @ManufacturerNameId AND
ManufacturerReference = @ManufacturerReference;
IF @NewId IS NULL
BEGIN
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE
BEGIN TRANSACTION
SELECT @NewId = [Id] FROM PDProductDetails
WHERE GDDataSourceVersionId = @GDDataSourceVersionId AND
ManufacturerId = @ManufacturerNameId AND
ManufacturerReference = @ManufacturerReference;
IF @NewId IS NULL
BEGIN
INSERT INTO PDProductDetails (GDDataSourceVersionId, ManufacturerId, ManufacturerReference, PropertiesJson, OriginalContentPage)
VALUES(@GDDataSourceVersionId, @ManufacturerNameId, @ManufacturerReference, @PropertiesJson, @OriginalContentPage);
SELECT @NewId = SCOPE_IDENTITY();
END
COMMIT TRANSACTION
END
SELECT @NewId;
END
GO
多个线程会调用它并插入产品详细信息。然而,使用这个我很快陷入僵局。我改用了另一种方法,使用Merge:
CREATE PROCEDURE [dbo].usp_insert_pdproductdetails
@GDDataSourceVersionId INT,
@ManufacturerNameId BIGINT,
@ManufacturerReference NVARCHAR(255),
@PropertiesJson NVARCHAR(MAX),
@OriginalContentPage NVARCHAR(MAX),
@NewId BIGINT OUT
AS
BEGIN
SET NOCOUNT ON;
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
BEGIN TRANSACTION;
MERGE
INTO [dbo].[PDProductDetails] T
USING (SELECT @GDDataSourceVersionId, @ManufacturerNameId, @ManufacturerReference, @PropertiesJson, @OriginalContentPage)
AS Source (GDDataSourceVersionId, ManufacturerNameId, ManufacturerReference, PropertiesJson, OriginalContentPage)
ON T.GDDataSourceVersionId = Source.GDDataSourceVersionId AND
T.ManufacturerId = Source.ManufacturerNameId AND
T.ManufacturerReference = Source.ManufacturerReference
WHEN NOT MATCHED THEN
INSERT (GDDataSourceVersionId, ManufacturerId, ManufacturerReference, PropertiesJson, OriginalContentPage)
VALUES(Source.GDDataSourceVersionId, Source.ManufacturerNameId,
Source.ManufacturerReference, Source.PropertiesJson, Source.OriginalContentPage);
COMMIT TRANSACTION;
SELECT @NewId = [Id] FROM PDProductDetails (NOLOCK)
WHERE GDDataSourceVersionId = @GDDataSourceVersionId AND
ManufacturerId = @ManufacturerNameId AND
ManufacturerReference = @ManufacturerReference;
SELECT @NewId;
END
GO
这总是合并行并稍后选择。它仍然僵硬,不像另一个那么快,但仍然。
如何实现insert ignore和return inserted id功能,这在并发环境中不会死锁?
答案 0 :(得分:0)
在@ ta.speot.is提到你可以用合并做OUTPUT后,我搜索了如何将它分配给变量和answer mentioned it。
我使用了这个存储过程。:
CREATE PROCEDURE [dbo].usp_insert_pdproductdetails
@GDDataSourceVersionId INT,
@ManufacturerNameId BIGINT,
@ManufacturerReference NVARCHAR(255),
@PropertiesJson NVARCHAR(MAX),
@OriginalContentPage NVARCHAR(MAX),
@NewId BIGINT OUT
AS
BEGIN
SET NOCOUNT ON;
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
BEGIN TRANSACTION;
MERGE
INTO [dbo].[PDProductDetails] T
USING (SELECT @GDDataSourceVersionId, @ManufacturerNameId, @ManufacturerReference, @PropertiesJson, @OriginalContentPage)
AS Source (GDDataSourceVersionId, ManufacturerNameId, ManufacturerReference, PropertiesJson, OriginalContentPage)
ON T.GDDataSourceVersionId = Source.GDDataSourceVersionId AND
T.ManufacturerId = Source.ManufacturerNameId AND
T.ManufacturerReference = Source.ManufacturerReference
WHEN MATCHED THEN
UPDATE SET @NewId = T.Id
WHEN NOT MATCHED THEN
INSERT (GDDataSourceVersionId, ManufacturerId, ManufacturerReference, PropertiesJson, OriginalContentPage)
VALUES(Source.GDDataSourceVersionId, Source.ManufacturerNameId,
Source.ManufacturerReference, Source.PropertiesJson, Source.OriginalContentPage);
SET @NewId = ISNULL(@NewId, SCOPE_IDENTITY());
COMMIT TRANSACTION;
SELECT @NewId;
END
GO
编辑:正如@ ta.speot.is所提到的,使用表值参数进行批量请求会更好,使用相同的方法(MERGE会将表输入用作Source)。