在Sql Server表中有效地随机化(shuffle)数据

时间:2011-08-02 10:24:33

标签: sql sql-server random

我有一个包含数据的表,我必须随机化。通过随机化,我的意思是使用随机行中的数据来更新同一列中的另一行。问题是表本身很大(超过2 000 000行)。

我编写了一段使用while循环的代码,但速度很慢。

有没有人对更有效的实现随机化的方法有任何建议?

4 个答案:

答案 0 :(得分:4)

为了更新行,更新会有大量的处理时间(CPU + I / O)。

您是否测量过随机化行与执行更新的相对费用?

你需要做的只是选择随机行,这是一个有效的方法来选择一个随机的行样本(在这种情况下是1%的行)

SELECT * FROM myTable
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), pkID) & 0x7fffffff AS float) / CAST (0x7fffffff AS int)

其中pkID是您的主键列。

这篇文章可能很有趣:

答案 1 :(得分:2)

要对10列中的数据进行随机播放,以便每行10个值被其他行中的其他值替换, 会很昂贵。

你必须阅读200万行10次。

SELECT将是

SELECT
    FirstName, LastName, VIN, ...
FROM
    (SELECT FirstName FROM MyTable ORDER BY NEWID()) FirstName
    JOIN
    (SELECT LastName FROM MyTable ORDER BY NEWID()) LastName ON 1=1
    JOIN
    (SELECT VIN FROM MyTable ORDER BY NEWID()) VIN ON 1=1
    JOIN
    ...

我也不会更新,我会创建一个新表

SELECT
    FirstName, LastName, VIN, ...
INTO
    StagingTable
FROM
    (SELECT FirstName FROM MyTable ORDER BY NEWID()) FirstName
    JOIN
    (SELECT LastName FROM MyTable ORDER BY NEWID()) LastName ON 1=1
    JOIN
    (SELECT VIN FROM MyTable ORDER BY NEWID()) VIN ON 1=1
    JOIN
    ...

然后添加密钥等,删除旧表,重命名。或使用SYNONYM指向新表

如果你想更新,那我就这样做。或者将其分解为10个更新。

UPDATE
   M
SET
   Firstname = FirstName.FirstName,
   LastName = LastName.LastName,
   ...
FROM
    MyTable M
    JOIN 
    (SELECT FirstName FROM MyTable ORDER BY NEWID()) FirstName ON 1=1
    JOIN
    (SELECT LastName FROM MyTable ORDER BY NEWID()) LastName ON 1=1
    JOIN
    (SELECT VIN FROM MyTable ORDER BY NEWID()) VIN ON 1=1
    JOIN
    ...

答案 2 :(得分:2)

基于Mitch Wheats回答链接到这个article on scrambline data你可以做这样的事情来扰乱一堆字段,你不仅限于ID:

;WITH Randomize AS 
( 
SELECT ROW_NUMBER() OVER (ORDER BY [UserID]) AS orig_rownum, 
      ROW_NUMBER() OVER (ORDER BY NewId()) AS new_rownum, 
      * 
FROM [UserTable]
) 
UPDATE T1 
   SET [UserID] = T2.[UserID]
      ,[FirstName] = T2.[FirstName]
      ,[LastName] = T2.[LastName]
      ,[AddressLine1] =  T2.[AddressLine1]
      ,[AddressLine2] =  T2.[AddressLine2]
      ,[AddressLine3] =  T2.[AddressLine3]
      ,[City] = T2.[City]
      ,[State] = T2.[State]
      ,[Pincode] = T2.[Pincode]
      ,[PhoneNumber] = T2.[PhoneNumber]
      ,[MobileNumber] = T2.[MobileNumber]
      ,[Email] = T2.[Email]
      ,[Status] = T2.[Status] 
FROM Randomize T1 
      join Randomize T2 on T1.orig_rownum = T2.new_rownum 
;

因此,您不仅限于这样做,如文章所示:

;WITH Randomize AS 
( 
SELECT ROW_NUMBER() OVER (ORDER BY Id) AS orig_rownum, 
      ROW_NUMBER() OVER (ORDER BY NewId()) AS new_rownum, 
      * 
FROM [MyTable]
) 
UPDATE T1 SET Id = T2.Id 
FROM Randomize T1 
      join Randomize T2 on T1.orig_rownum = T2.new_rownum 
;

这种方法的危险在于您调整的数据量。使用CTE会将所有内容都填入内存,因此我发现这是相当快的(500k行表为19秒)。如果您有一个包含数百万条记录的表,您将需要小心。您应该考虑实际需要多少数据,或者是一个良好的人口样本,用于测试和开发。

答案 3 :(得分:1)

我将上面找到的答案组合在一个查询中,该查询将每列随机化,最后以完全随机化的记录结束

UPDATE MyTable SET
  columnA = columnA.newValue,
  columnB = columnB.newValue,
  -- Optionally, for maintaining a group of values like street, zip, city in an address
  columnC = columnGroup.columnC,
  columnD = columnGroup.columnD,
  columnE = columnGroup.columnE
FROM MyTable
INNER JOIN (
  SELECT ROW_NUMBER() OVER (ORDER BY id) AS rn, id FROM MyTable
) AS PKrows ON MyTable.id = PKrows.id
-- repeat the following JOIN for each column you want to randomize
INNER JOIN (
  SELECT ROW_NUMBER() OVER (ORDER BY NEWID()) AS rn, columnA AS newValue FROM MyTable
) AS columnA ON PKrows.rn = columnA.rn
INNER JOIN (
  SELECT ROW_NUMBER() OVER (ORDER BY NEWID()) AS rn, columnB AS newValue FROM MyTable
) AS columnB ON PKrows.rn = columnB.rn

-- Optionally, if you want to maintain a group of values spread out over several columns
INNER JOIN (
  SELECT ROW_NUMBER() OVER (ORDER BY NEWID()) AS rn, columnC, columnD, columnE FROM MyTable
) AS columnGroup ON PKrows.rn = columnGroup.rn

这个查询在10K行表上耗时8秒,在Windows 2008 R2计算机上洗了8列,内存为16GB,内存为4个XEON核心@ 2.93GHz