我有一组具有7种不同属性的对象(5000+)。两个属性是叔,其余是二进制。每个对象都指定了所有7个属性。在某些情况下,二进制属性可能会变为一元。
偶尔我需要从这个集合中选择前N个随机对象,按照每个类别中标签的频率加权对象的总数。
目前,我将sql server表中的所有数据都作为对象,propertyMask对;但是,我可以重新组织任何其他必要的方式。
示例:
数据是:
object1|9 <- 1001 black circle only (all other properties are 0)
object2|81 <- 101 0001 black square with solid color (all other properties are 0)
object3|148 <- 1001 0100 yellow square with dashed contour
etc.
说,我最终得到了包含600个黑色,300个黄色和100个蓝色物体的1k物体。我需要选择前10个对象。如果我只考虑一个属性,我将只采用任何6个黑色,3个黄色和1个蓝色物体。但我还有6个其他属性要考虑,并确保我有适量的圆形,正方形和三角形。等等。此时我甚至不知道如何解决这个问题。
任何建议将不胜感激。
*编辑:
我按以下格式重新填充数据
name | att1 | att2 | ...
obj1 | 1 | 8 | ...
obj2 | 2 | 16 | ...
obj3 | 1 | 32 | ...
有没有办法选择按每个属性的频率加权的TOP N个对象?每个对象有7个属性;没有空值。
谢谢!
答案 0 :(得分:1)
它很乱,它并不总是获取所需的确切行数或完美分布,但它非常接近。
那它是如何运作的:
DECLARE @TargetRowNum INT = 100;
WITH ValuesPivotted AS(
SELECT O.id
, RowNum = ROW_NUMBER() OVER (ORDER BY NEWID())
, [0] = CASE WHEN O.atr1 = 0 THEN 1 ELSE 0 END
, [1] = CASE WHEN O.atr1 = 1 THEN 1 ELSE 0 END
, [2] = CASE WHEN O.atr1 = 2 THEN 1 ELSE 0 END
, [4] = CASE WHEN O.atr2 = 4 THEN 1 ELSE 0 END
, [8] = CASE WHEN O.atr2 = 8 THEN 1 ELSE 0 END
, [16] = CASE WHEN O.atr3 = 16 THEN 1 ELSE 0 END
, [32] = CASE WHEN O.atr3 = 32 THEN 1 ELSE 0 END
, [64] = CASE WHEN O.atr4 = 64 THEN 1 ELSE 0 END
, [128] = CASE WHEN O.atr4 = 128 THEN 1 ELSE 0 END
FROM dbo.objects AS O
),
TargetDistribution AS (
SELECT Target0 = ROUND(CAST(SUM([0] ) AS FLOAT) / COUNT(*) * @TargetRowNum, 0)
, Target1 = ROUND(CAST(SUM([1] ) AS FLOAT) / COUNT(*) * @TargetRowNum, 0)
, Target2 = ROUND(CAST(SUM([2] ) AS FLOAT) / COUNT(*) * @TargetRowNum, 0)
, Target4 = ROUND(CAST(SUM([4] ) AS FLOAT) / COUNT(*) * @TargetRowNum, 0)
, Target8 = ROUND(CAST(SUM([8] ) AS FLOAT) / COUNT(*) * @TargetRowNum, 0)
, Target16 = ROUND(CAST(SUM([16] ) AS FLOAT) / COUNT(*) * @TargetRowNum, 0)
, Target32 = ROUND(CAST(SUM([32] ) AS FLOAT) / COUNT(*) * @TargetRowNum, 0)
, Target64 = ROUND(CAST(SUM([64] ) AS FLOAT) / COUNT(*) * @TargetRowNum, 0)
, Target128 = ROUND(CAST(SUM([128]) AS FLOAT) / COUNT(*) * @TargetRowNum, 0)
FROM ValuesPivotted
),
SelectRows AS(
SELECT VP.id
, RowNum
, KeepRow = 1
, Target0 , Sum0 = [0]
, Target1 , Sum1 = [1]
, Target2 , Sum2 = [2]
, Target4 , Sum4 = [4]
, Target8 , Sum8 = [8]
, Target16 , Sum16 = [16]
, Target32 , Sum32 = [32]
, Target64 , Sum64 = [64]
, Target128 , Sum128 = [128]
FROM ValuesPivotted AS VP
CROSS JOIN TargetDistribution AS TD
WHERE VP.RowNum = 1
UNION ALL
SELECT
VP.id
, VP.RowNum
, KeepRow = ISNULL(SkipRow.Value, 1)
, Target0 , Sum0 = Sum0 + ISNULL(SkipRow.Value, [0] )
, Target1 , Sum1 = Sum1 + ISNULL(SkipRow.Value, [1] )
, Target2 , Sum2 = Sum2 + ISNULL(SkipRow.Value, [2] )
, Target4 , Sum4 = Sum4 + ISNULL(SkipRow.Value, [4] )
, Target8 , Sum8 = Sum8 + ISNULL(SkipRow.Value, [8] )
, Target16 , Sum16 = Sum16 + ISNULL(SkipRow.Value, [16] )
, Target32 , Sum32 = Sum32 + ISNULL(SkipRow.Value, [32] )
, Target64 , Sum64 = Sum64 + ISNULL(SkipRow.Value, [64] )
, Target128 , Sum128 = Sum128 + ISNULL(SkipRow.Value, [128])
FROM SelectRows AS SR
INNER JOIN ValuesPivotted AS VP
ON VP.RowNum = SR.RowNum + 1
CROSS APPLY(
SELECT Value =
CASE WHEN Sum0 + [0] <= Target0
AND Sum1 + [1] <= Target1
AND Sum2 + [2] <= Target2
AND Sum4 + [4] <= Target4
AND Sum8 + [8] <= Target8
AND Sum16 + [16] <= Target16
AND Sum32 + [32] <= Target32
AND Sum64 + [64] <= Target64
AND Sum128 + [128] <= Target128
THEN NULL ELSE 0 END
) AS SkipRow
WHERE Sum0 < Target0
OR Sum1 < Target1
OR Sum2 < Target2
OR Sum4 < Target4
OR Sum8 < Target8
OR Sum16 < Target16
OR Sum32 < Target32
OR Sum64 < Target64
OR Sum128 < Target128
)
SELECT O.*
FROM SelectRows AS SR
INNER JOIN dbo.objects AS O
ON SR.id = O.id
WHERE SR.KeepRow = 1
OPTION(MAXRECURSION 0)
编辑: SelectRows中的WHERE子句没有按预期执行,在满足所有目标时停止递归,现在确实如此。