Question

我有一张超过1,000,000条记录的表格。我需要用别名替换文本字段中的任何名称，以帮助识别数据。对于此示例，我们假设该表为TemporaryTest并且有两个字段：Id（关键字段）和IndexedXML（文本字段）。

我有第二个表AppellationSubstitution，其中包含以下列：TextEntry（需要替换的名称），Length（TextEntry的长度），Replacement（替换名称，可能有不同的长度）。该表有大约110,000行。

我使用的第一步是（正则表达式匹配文本字段中的单词 - 由于此数据库中显示的一些奇怪字符，它看起来有点奇怪）：

SELECT id, 
       matchindex, 
       matchlength, 
       replacement  
FROM   TemporaryTest 
       CROSS APPLY
master.dbo.Regexmatches('([Xx]-)?[\w-[0-9üÿ_]]{2,}(-[\w-[0-9üÿ_]]{2,})?(''[\w-[0-9üÿ_]])?', [IndexedXML], 
master.dbo.Regexoptionenumeration(0, 0, 1, 1, 0, 0, 0, 0, 0)) 
       INNER JOIN dbo.appellationsubstitution 
       ON match = textentry
       ORDER BY Id, MatchIndex DESC;--if replace in forward order, insertion point gets moved

这将生成一个包含超过100,000行的表，以下显示几行：

Id matchindex matchlength replacement

99309 122 5 “Demarcus”
108639 106 5 “Demarcus”
109809 84 6 “Rehbein”
110373 89 7 “Reginald”
111156 105 5 “Demarcus”
112452 129 6 “Thie”
112896 113 6 “Diberardino”
112896 92 6 “Diberardino”
113503 119 3 “Rubin”

我目前正在尝试的完整程序是：

SET NOCOUNT ON;
SET XACT_ABORT ON;
BEGIN TRANSACTION;

DECLARE ReplaceCursor CURSOR LOCAL FOR
SELECT id, 
       matchindex, 
       matchlength, 
       replacement
FROM   TemporaryTest 
       CROSS APPLY
master.dbo.Regexmatches('([Xx]-)?[\w-[0-9üÿ_]]{2,}(-[\w-[0-9üÿ_]]{2,})?(''[\w-[0-9üÿ_]])?', [IndexedXML], 
master.dbo.Regexoptionenumeration(0, 0, 1, 1, 0, 0, 0, 0, 0)) 
       INNER JOIN dbo.appellationsubstitution 
       ON match = textentry
       ORDER BY Id, MatchIndex DESC;--if replace in forward order, insertion point gets moved 
DECLARE @Rid int, @Rmi AS int, @Rml AS int, @Rrep AS nvarchar(255);
OPEN ReplaceCursor;
FETCH NEXT FROM ReplaceCursor INTO @Rid, @Rmi, @Rml, @Rrep;
WHILE @@FETCH_STATUS = 0
BEGIN
    UPDATE TemporaryTest
    Set IndexedXML =  STUFF([IndexedXML],@Rmi+1,@Rml,@Rrep) 
        WHERE Id = @Rid;
    FETCH NEXT FROM ReplaceCursor INTO @Rid, @Rmi, @Rml, @Rrep;
END;
CLOSE ReplaceCursor;
DEALLOCATE ReplaceCursor;
COMMIT TRANSACTION

这可行，但需要很长时间才能运行（超过一小时但尚未完成），而IndexedXML是我在生产数据库中最小的文本字段之一。

我使用游标，因为我不知道在同一个单元格上管理顺序STUFF调用的任何其他方法，后续的STUFF调用使用前一个调用的结果。

我是否选择了正确的课程，还是有更快/更清洁的方法来实现这一目标？

在同一个单元上重复操作。可以使用查询优化或替换此游标过程吗？

0 个答案: