我想从统一分布中生成一个随机值,其中mean = 0,并且对于T-SQL中给定数据表的每一行,标准devation = 1。另外,我想设置种子以确保分析的可重复性。以下是没有用的想法:
使用带有声明编号的函数RAND()
不符合此目标:为数据集的每一行生成相同的随机值。
这样的解决方案:
SELECT ABS(CAST(CAST(NEWID() AS VARBINARY) AS INT)) AS [RandomNumber]
无法解决问题,因为它不可再现。
修改
由于我的表有数亿条记录,因此性能很重要。
答案 0 :(得分:4)
Rand()函数可以在开始时通过传递整数种子值来播种。如果在生成任何随机数之前执行此操作一次,则随机数序列将是可重复的。单独生成值将确保Rand()函数按顺序返回数字。以下将产生 n 伪随机数的均匀分布,均值= 0且标准差= 1:
DECLARE @Mean FLOAT = 0.0;
DECLARE @stDev FLOAT = 1.0;
DECLARE @n INT = 100000; -- count of random numbers to generate
DECLARE @U TABLE(x FLOAT); -- table of random numbers
DECLARE @SEED INT = 123456; -- seed to ensure list is reproducible
SELECT RAND(@Seed);
SET NOCOUNT ON;
BEGIN TRAN
DECLARE @x INT = 0; -- counter
WHILE @x < @n
BEGIN
INSERT INTO @U (x)
SELECT @Mean + (2 * SQRT(3) * @stDev) * (RAND() - 0.5)
SET @x = @x + 1;
END;
COMMIT
-- Check the results
SELECT * from @U;
SELECT AVG([@U].x) AS mean,
STDEV([@U].x) AS stDev
FROM @U;
您可以使用游标遍历现有表中的记录,而不是在while循环中插入临时表,并对每条记录执行更新。正如评论中所提到的,性能可能是一个问题,但它符合要求“均值分布均值= 0且标准偏差= 1”和“再现性”。 Rand()函数的工作方式强制执行“1 by 1”更新。
下面是一个替代方案,它具有更好的性能(应该在2秒内以100万行运行)并替换Rand()函数。这允许记录在单个UPDATE
中更新,但依赖于表中唯一的数字ID
字段,并更新名为RandomNumber
的字段。 Rand()函数由( (ID * @SEED ) % 1000 ) / 1000
替换,可以对其进行改进。
DECLARE @Mean FLOAT = 0.0;
DECLARE @stDev FLOAT = 1.0;
DECLARE @SEED numeric(18,0) = 1234567890.0; -- seed to ensure list is reproducible
SET NOCOUNT ON;
BEGIN TRAN
UPDATE TestTable
set Randomnumber = @Mean + (2 * SQRT(3) * @stDev) * (( (ID * @SEED ) % 1000 ) / 1000 - 0.5)
COMMIT
-- Check the results
SELECT AVG(RandomNumber) AS mean,
STDEV(RandomNumber ) AS stDev
FROM TestTable;
答案 1 :(得分:4)
恕我直言的主要问题是你如何看待'repeatabililty'?或者不同地问:什么“驱动”随机性?我可以设想一种解决方案,只要数据没有变化,就可以为每次运行的每条记录添加相同的随机数。但是,如果数据发生变化,您会发生什么?
为了好玩,我在一个(不是很有代表性的)测试表上做了以下测试,测试表有100万行:
-- seed
SELECT Rand(0)
-- will show the same random number for EVERY record
SELECT Number, blah = Convert(varchar(100), NewID()), random = Rand()
INTO #test
FROM master.dbo.fn_int_list(1, 1000000)
CREATE UNIQUE CLUSTERED INDEX uq0_test ON #test (Number)
SET NOCOUNT ON
GO
DECLARE @start_time datetime = CURRENT_TIMESTAMP,
@c_number int
-- update each record (one by one) and set the random number based on 'the next Rand()' value
-- => the order of the records drives the distribution of the Rand() value !
-- seed
SELECT @c_number = Rand(0)
-- update 1 by 1
DECLARE cursor_no_transaction CURSOR LOCAL STATIC
FOR SELECT Number
FROM #test
ORDER BY Number
OPEN cursor_no_transaction
FETCH NEXT FROM cursor_no_transaction INTO @c_number
WHILE @@FETCH_STATUS = 0
BEGIN
UPDATE #test
SET random = Rand()
WHERE Number = @c_number
FETCH NEXT FROM cursor_no_transaction INTO @c_number
END
CLOSE cursor_no_transaction
DEALLOCATE cursor_no_transaction
PRINT 'Time needed (no transaction) : ' + Convert(nvarchar(100), DateDiff(ms, @start_time, CURRENT_TIMESTAMP)) + ' ms.'
SELECT _avg = AVG(random), _stdev = STDEV(random) FROM #test
GO
DECLARE @start_time datetime = CURRENT_TIMESTAMP,
@c_number int
BEGIN TRANSACTION
-- update each record (one by one) and set the random number based on 'the next Rand()' value
-- => the order of the records drives the distribution of the Rand() value !
-- seed
SELECT @c_number = Rand(0)
-- update 1 by 1 but all of it inside 1 single transaction
DECLARE cursor_single_transaction CURSOR LOCAL STATIC
FOR SELECT Number
FROM #test
ORDER BY Number
OPEN cursor_single_transaction
FETCH NEXT FROM cursor_single_transaction INTO @c_number
WHILE @@FETCH_STATUS = 0
BEGIN
UPDATE #test
SET random = Rand()
WHERE Number = @c_number
FETCH NEXT FROM cursor_single_transaction INTO @c_number
END
CLOSE cursor_single_transaction
DEALLOCATE cursor_single_transaction
COMMIT TRANSACTION
PRINT 'Time needed (single transaction) : ' + Convert(nvarchar(100), DateDiff(ms, @start_time, CURRENT_TIMESTAMP)) + ' ms.'
SELECT _avg = AVG(random), _stdev = STDEV(random) FROM #test
GO
DECLARE @start_time datetime = CURRENT_TIMESTAMP
-- update each record (single operation), use the Number column to reseed the Rand() function for every record
UPDATE #test
SET random = Rand(Number)
PRINT 'Time needed Rand(Number) : ' + Convert(nvarchar(100), DateDiff(ms, @start_time, CURRENT_TIMESTAMP)) + ' ms.'
SELECT _avg = AVG(random), _stdev = STDEV(random) FROM #test
GO
DECLARE @start_time datetime = CURRENT_TIMESTAMP
-- update each record (single operation), use 'a bunch of fields' to reseed the Rand() function for every record
UPDATE #test
SET random = Rand(BINARY_CHECKSUM(Number, blah))
PRINT 'Time needed Rand(BINARY_CHECKSUM(Number, blah)) : ' + Convert(nvarchar(100), DateDiff(ms, @start_time, CURRENT_TIMESTAMP)) + ' ms.'
SELECT _avg = AVG(random), _stdev = STDEV(random) FROM #test
结果或多或少符合预期:
Time needed (no transaction) : 24570 ms.
_avg _stdev
---------------------- ----------------------
0.499630943538644 0.288686960086461
Time needed (single transaction) : 14813 ms.
_avg _stdev
---------------------- ----------------------
0.499630943538646 0.288686960086461
Time needed Rand(Number) : 1203 ms.
_avg _stdev
---------------------- ----------------------
0.499407423620328 0.291093824839539
Time needed Rand(BINARY_CHECKSUM(Number, blah)) : 1250 ms.
_avg _stdev
---------------------- ----------------------
0.499715398881586 0.288579510523627
所有这些都是'可重复的',问题是'可重复'是否意味着你想要它在这里意味着什么。我坚持使用AVG()和STDEV()来粗略了解分布,我会留给你看看它们是否真的符合要求(如果没有,如何改进它=)< / p> 100万行的1.2秒对于100万行恕我直言并不太糟糕。也就是说,如果你的表包含额外的列,它将占用更多的空间,因此需要更多的时间!
希望这能让你开始......
答案 2 :(得分:1)
DECLARE @userReportId BIGINT
SET @userReportId = FLOOR(RAND()*(10000000000000-1) + 1);
答案 3 :(得分:1)
可能需要重复的随机数来重复这种情况,其中测试出错以便重现异常的情况。
以下建议将使用位置和随机数填充物理表(添加索引!)。
将此列表与简单连接一起使用,以使用随机数连接每一行。
每次调用都会将相同的随机数绑定到给定的行。
更改此设置可以通过使用新的随机位置重新定位randoms(或截断 - 重新填充或删除 - 重新创建表格)来完成。
这应该很快......
CREATE TABLE dbo.MyRepeatableRandoms(CurrentPosition BIGINT,RandomNumber BIGINT);
GO
DECLARE @CountOfNumbers INT=5; --set a fitting max count here
WITH Tally AS
(
SELECT TOP(@CountOfNumbers) ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS Nr
FROM master..spt_values
CROSS JOIN master..spt_values X
CROSS JOIN master..spt_values Y
)
INSERT INTO dbo.MyRepeatableRandoms
SELECT Nr,CAST(CAST(NEWID() AS VARBINARY(8)) AS BIGINT) FROM Tally;
--Use this list with a simple join to bind it to the rows of your table
SELECT * FROM dbo.MyRepeatableRandoms ORDER BY CurrentPosition;
--Re-Position the list
WITH UpdateableCTE AS
(
SELECT ROW_NUMBER() OVER(ORDER BY A.posOrder) AS NewPos
,CurrentPosition
FROM dbo.MyRepeatableRandoms
CROSS APPLY(SELECT NEWID() AS posOrder) AS A
)
UPDATE UpdateableCTE SET CurrentPosition=NewPos;
--The same random numbers at new positions
SELECT * FROM MyRepeatableRandoms ORDER BY CurrentPosition;
GO
DROP TABLE dbo.MyRepeatableRandoms
结果
RandomNumber
1 -1939965404062448822
2 2786711671511266125
3 -3236707863137400753
4 -6029509773149087675
5 7815987559555455297
重新定位后
RandomNumber
1 7815987559555455297
2 -1939965404062448822
3 2786711671511266125
4 -6029509773149087675
5 -3236707863137400753
答案 4 :(得分:0)
这里是一个非常接近的纯粹的纯SQL:
select iif(rand(rand(id)) < .5, -1, 1) * sqrt(1 - exp(-1.27323954474*rand(id)*rand(id) *
(1 + 0.0586276296*rand(id)*rand(id)) / (1 + 0.0886745239*rand(id)*rand(id))))
from mytable
我选择id
列作为种子,但您可以选择对您最有意义的列。即根据需要将rand(id)
更改为rand(some_other_column)
。