如何根据所有候选行的应用权重在T-SQL中随机选择一个表行?
例如,我在一个表格中有一组行,加权为50,25和25(最多加100但不需要),我想随机选择其中一行,统计结果等价各自的重量。
答案 0 :(得分:15)
Dane的回答包括以引入平方定律的方式进行自我连接。联接后(n*n/2)
行,表中有n行。
更理想的是能够只解析一次表。
DECLARE @id int, @weight_sum int, @weight_point int
DECLARE @table TABLE (id int, weight int)
INSERT INTO @table(id, weight) VALUES(1, 50)
INSERT INTO @table(id, weight) VALUES(2, 25)
INSERT INTO @table(id, weight) VALUES(3, 25)
SELECT @weight_sum = SUM(weight)
FROM @table
SELECT @weight_point = FLOOR(((@weight_sum - 1) * RAND() + 1), 0)
SELECT
@id = CASE WHEN @weight_point < 0 THEN @id ELSE [table].id END,
@weight_point = @weight_point - [table].weight
FROM
@table [table]
ORDER BY
[table].Weight DESC
这将通过表格,将@id
设置为每个记录的id
值,同时递减@weight
点。最终,@weight_point
将变为负数。这意味着所有先前权重的SUM
大于随机选择的目标值。这是我们想要的记录,因此从那时起我们将@id
设置为自身(忽略表中的任何ID)。
这只会在表中运行一次,但即使所选值是第一条记录,也必须遍历整个表。因为平均位置是表格的一半(如果按递增权重排序则更少)写一个循环可能会更快......(特别是如果权重是共同的组):
DECLARE @id int, @weight_sum int, @weight_point int, @next_weight int, @row_count int
DECLARE @table TABLE (id int, weight int)
INSERT INTO @table(id, weight) VALUES(1, 50)
INSERT INTO @table(id, weight) VALUES(2, 25)
INSERT INTO @table(id, weight) VALUES(3, 25)
SELECT @weight_sum = SUM(weight)
FROM @table
SELECT @weight_point = ROUND(((@weight_sum - 1) * RAND() + 1), 0)
SELECT @next_weight = MAX(weight) FROM @table
SELECT @row_count = COUNT(*) FROM @table
SET @weight_point = @weight_point - (@next_weight * @row_count)
WHILE (@weight_point > 0)
BEGIN
SELECT @next_weight = MAX(weight) FROM @table WHERE weight < @next_weight
SELECT @row_count = COUNT(*) FROM @table WHERE weight = @next_weight
SET @weight_point = @weight_point - (@next_weight * @row_count)
END
-- # Once the @weight_point is less than 0, we know that the randomly chosen record
-- # is in the group of records WHERE [table].weight = @next_weight
SELECT @row_count = FLOOR(((@row_count - 1) * RAND() + 1), 0)
SELECT
@id = CASE WHEN @row_count < 0 THEN @id ELSE [table].id END,
@row_count = @row_count - 1
FROM
@table [table]
WHERE
[table].weight = @next_weight
ORDER BY
[table].Weight DESC
答案 1 :(得分:7)
您只需要对所有候选行的权重求和,然后在该总和中选择一个随机点,然后选择与该选定点协调的记录(每个记录逐渐带有累加权重和)。
DECLARE @id int, @weight_sum int, @weight_point int
DECLARE @table TABLE (id int, weight int)
INSERT INTO @table(id, weight) VALUES(1, 50)
INSERT INTO @table(id, weight) VALUES(2, 25)
INSERT INTO @table(id, weight) VALUES(3, 25)
SELECT @weight_sum = SUM(weight)
FROM @table
SELECT @weight_point = ROUND(((@weight_sum - 1) * RAND() + 1), 0)
SELECT TOP 1 @id = t1.id
FROM @table t1, @table t2
WHERE t1.id >= t2.id
GROUP BY t1.id
HAVING SUM(t2.weight) >= @weight_point
ORDER BY t1.id
SELECT @id
答案 2 :(得分:3)
如果你有很多记录,那么“逐步携带一个累积的[sic]权重和”部分是很昂贵的。如果你已经有很多分数/重量(即:范围很宽,大多数记录权重都是独一无二的.1-5颗星可能不会削减它),你可以做这样的事情来选择一个重量值。我在这里使用VB.Net进行演示,但这也可以在纯Sql中轻松完成:
Function PickScore()
'Assume we have a database wrapper class instance called SQL and seeded a PRNG already
'Get count of scores in database
Dim ScoreCount As Double = SQL.ExecuteScalar("SELECT COUNT(score) FROM [MyTable]")
' You could also approximate this with just the number of records in the table, which might be faster.
'Random number between 0 and 1 with ScoreCount possible values
Dim rand As Double = Random.GetNext(ScoreCount) / ScoreCount
'Use the equation y = 1 - x^3 to skew results in favor of higher scores
' For x between 0 and 1, y is also between 0 and 1 with a strong bias towards 1
rand = 1 - (rand * rand * rand)
'Now we need to map the (0,1] vector to [1,Maxscore].
'Just find MaxScore and mutliply by rand
Dim MaxScore As UInteger = SQL.ExecuteScalar("SELECT MAX(Score) FROM Songs")
Return MaxScore * rand
End Function
运行此选项,并选择分数小于返回权重的记录。如果多个记录共享该分数,则随机选择。这里的优点是你不必保留任何总和,你可以调整用于满足你的口味的概率方程。但同样,它在分数分布较大时效果最佳。
答案 3 :(得分:2)
使用随机数生成器执行此操作的方法是集成概率密度函数。使用一组离散值,您可以计算前缀总和(所有值的总和,直到此值)并存储它。使用此选项,您可以选择最大前缀总和(聚合到日期)值大于随机数。
在数据库中,必须更新插入后的后续值。如果更新的相对频率和数据集的大小没有使得这样做的成本过高,则意味着可以从单个s-argable(可以通过索引查找解析的谓词)查询中获得适当的值
答案 4 :(得分:0)
如果需要获取一组样本(例如,您要从5M行的集合中抽取50行),其中每行都有一个名为Content-Security-Policy: default-src 'none'; script-src 'self' www.google-analytics.com; connect-src 'self'; img-src 'self' www.google-analytics.com; style-src 'self' fonts.googleapis.com; font-src 'self' fonts.gstatic.com;
的列,该列为Weight
,如果值越大表示体重越重,则可以使用此功能:
int
这里的关键是使用POWER()函数,如图here
或者,您可以使用:
SELECT *
FROM
(
SELECT TOP 50 RowData, Weight
FROM MyTable
ORDER BY POWER(RAND(CAST(NEWID() AS VARBINARY)), (1.0/Weight)) DESC
) X
ORDER BY Weight DESC
由于this问题,您将校验和转换为1.0 * ABS(CAST(CHECKSUM(NEWID()) AS bigint)) / CAST(0x7FFFFFFF AS INT)
而不是BIGINT
:
因为校验和返回一个int,并且int的范围是-2 ^ 31 (-2,147,483,648)到2 ^ 31-1(2,147,483,647),abs()函数可以 如果结果恰好是正确,则返回一个溢出错误 -2,147,483,648!机会显然很低,大约为40亿分之一,但是我们每天在〜1.8b的行表上运行它, 因此大约一周一次!解决方法是将校验和转换为 bigint在腹肌之前。