我有一个包含数据的表,我必须随机化。通过随机化,我的意思是使用随机行中的数据来更新同一列中的另一行。问题是表本身很大(超过2 000 000行)。
我编写了一段使用while循环的代码,但速度很慢。
有没有人对更有效的实现随机化的方法有任何建议?
答案 0 :(得分:4)
为了更新行,更新会有大量的处理时间(CPU + I / O)。
您是否测量过随机化行与执行更新的相对费用?
你需要做的只是选择随机行,这是一个有效的方法来选择一个随机的行样本(在这种情况下是1%的行)
SELECT * FROM myTable
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), pkID) & 0x7fffffff AS float) / CAST (0x7fffffff AS int)
其中pkID
是您的主键列。
这篇文章可能很有趣:
答案 1 :(得分:2)
要对10列中的数据进行随机播放,以便每行10个值被其他行中的其他值替换, 会很昂贵。
你必须阅读200万行10次。
SELECT将是
SELECT
FirstName, LastName, VIN, ...
FROM
(SELECT FirstName FROM MyTable ORDER BY NEWID()) FirstName
JOIN
(SELECT LastName FROM MyTable ORDER BY NEWID()) LastName ON 1=1
JOIN
(SELECT VIN FROM MyTable ORDER BY NEWID()) VIN ON 1=1
JOIN
...
我也不会更新,我会创建一个新表
SELECT
FirstName, LastName, VIN, ...
INTO
StagingTable
FROM
(SELECT FirstName FROM MyTable ORDER BY NEWID()) FirstName
JOIN
(SELECT LastName FROM MyTable ORDER BY NEWID()) LastName ON 1=1
JOIN
(SELECT VIN FROM MyTable ORDER BY NEWID()) VIN ON 1=1
JOIN
...
然后添加密钥等,删除旧表,重命名。或使用SYNONYM指向新表
如果你想更新,那我就这样做。或者将其分解为10个更新。
UPDATE
M
SET
Firstname = FirstName.FirstName,
LastName = LastName.LastName,
...
FROM
MyTable M
JOIN
(SELECT FirstName FROM MyTable ORDER BY NEWID()) FirstName ON 1=1
JOIN
(SELECT LastName FROM MyTable ORDER BY NEWID()) LastName ON 1=1
JOIN
(SELECT VIN FROM MyTable ORDER BY NEWID()) VIN ON 1=1
JOIN
...
答案 2 :(得分:2)
基于Mitch Wheats回答链接到这个article on scrambline data你可以做这样的事情来扰乱一堆字段,你不仅限于ID:
;WITH Randomize AS
(
SELECT ROW_NUMBER() OVER (ORDER BY [UserID]) AS orig_rownum,
ROW_NUMBER() OVER (ORDER BY NewId()) AS new_rownum,
*
FROM [UserTable]
)
UPDATE T1
SET [UserID] = T2.[UserID]
,[FirstName] = T2.[FirstName]
,[LastName] = T2.[LastName]
,[AddressLine1] = T2.[AddressLine1]
,[AddressLine2] = T2.[AddressLine2]
,[AddressLine3] = T2.[AddressLine3]
,[City] = T2.[City]
,[State] = T2.[State]
,[Pincode] = T2.[Pincode]
,[PhoneNumber] = T2.[PhoneNumber]
,[MobileNumber] = T2.[MobileNumber]
,[Email] = T2.[Email]
,[Status] = T2.[Status]
FROM Randomize T1
join Randomize T2 on T1.orig_rownum = T2.new_rownum
;
因此,您不仅限于这样做,如文章所示:
;WITH Randomize AS
(
SELECT ROW_NUMBER() OVER (ORDER BY Id) AS orig_rownum,
ROW_NUMBER() OVER (ORDER BY NewId()) AS new_rownum,
*
FROM [MyTable]
)
UPDATE T1 SET Id = T2.Id
FROM Randomize T1
join Randomize T2 on T1.orig_rownum = T2.new_rownum
;
这种方法的危险在于您调整的数据量。使用CTE会将所有内容都填入内存,因此我发现这是相当快的(500k行表为19秒)。如果您有一个包含数百万条记录的表,您将需要小心。您应该考虑实际需要多少数据,或者是一个良好的人口样本,用于测试和开发。
答案 3 :(得分:1)
我将上面找到的答案组合在一个查询中,该查询将每列随机化,最后以完全随机化的记录结束
UPDATE MyTable SET
columnA = columnA.newValue,
columnB = columnB.newValue,
-- Optionally, for maintaining a group of values like street, zip, city in an address
columnC = columnGroup.columnC,
columnD = columnGroup.columnD,
columnE = columnGroup.columnE
FROM MyTable
INNER JOIN (
SELECT ROW_NUMBER() OVER (ORDER BY id) AS rn, id FROM MyTable
) AS PKrows ON MyTable.id = PKrows.id
-- repeat the following JOIN for each column you want to randomize
INNER JOIN (
SELECT ROW_NUMBER() OVER (ORDER BY NEWID()) AS rn, columnA AS newValue FROM MyTable
) AS columnA ON PKrows.rn = columnA.rn
INNER JOIN (
SELECT ROW_NUMBER() OVER (ORDER BY NEWID()) AS rn, columnB AS newValue FROM MyTable
) AS columnB ON PKrows.rn = columnB.rn
-- Optionally, if you want to maintain a group of values spread out over several columns
INNER JOIN (
SELECT ROW_NUMBER() OVER (ORDER BY NEWID()) AS rn, columnC, columnD, columnE FROM MyTable
) AS columnGroup ON PKrows.rn = columnGroup.rn
这个查询在10K行表上耗时8秒,在Windows 2008 R2计算机上洗了8列,内存为16GB,内存为4个XEON核心@ 2.93GHz