从加权类别/属性中选择集合中的唯一成员

时间:2015-02-26 23:18:58

标签: sql sql-server-2008 tsql

我有一组具有7种不同属性的对象(5000+)。两个属性是叔,其余是二进制。每个对象都指定了所有7个属性。在某些情况下,二进制属性可能会变为一元。

偶尔我需要从这个集合中选择前N个随机对象,按照每个类别中标签的频率加权对象的总数。

目前,我将sql server表中的所有数据都作为对象,propertyMask对;但是,我可以重新组织任何其他必要的方式。

示例:

  1. 黑蓝黄(1,2,4)
  2. 圆形方形三角形(8,16,32)
  3. 纯色/网状色(64)
  4. 虚线轮廓/无轮廓(128)
  5. 等。 (256)
  6. 数据是:

    object1|9   <-      1001 black circle only (all other properties are 0)
    object2|81  <-  101 0001 black square with solid color (all other properties are 0)
    object3|148 <- 1001 0100 yellow square with dashed contour
    etc.
    

    说,我最终得到了包含600个黑色,300个黄色和100个蓝色物体的1k物体。我需要选择前10个对象。如果我只考虑一个属性,我将只采用任何6个黑色,3个黄色和1个蓝色物体。但我还有6个其他属性要考虑,并确保我有适量的圆形,正方形和三角形。等等。此时我甚至不知道如何解决这个问题。

    任何建议将不胜感激。

    *编辑:

    我按以下格式重新填充数据

    name | att1 | att2 | ...
    obj1 |  1   |   8  | ...
    obj2 |  2   |   16 | ...
    obj3 |  1   |   32 | ...  
    

    有没有办法选择按每个属性的频率加权的TOP N个对象?每个对象有7个属性;没有空值。

    谢谢!

1 个答案:

答案 0 :(得分:1)

它很乱,它并不总是获取所需的确切行数或完美分布,但它非常接近。

那它是如何运作的:

  • ValuesPivotted:转动所有不同的值,并为每一行提供随机的rownumber
  • TargetDistribution:为每个不同的值确定您需要的数量
  • SelectRows:在每行的基础上遍历ValuesPivotted中的每一行,查看是否要跳过该行,否则会违反目标以获得不同的值。否则,为适用于该行的每个值增加Sum。

DECLARE @TargetRowNum INT = 100;

WITH ValuesPivotted AS(
    SELECT O.id
         , RowNum = ROW_NUMBER() OVER (ORDER BY NEWID())
         , [0] =   CASE WHEN O.atr1 = 0   THEN 1 ELSE 0 END
         , [1] =   CASE WHEN O.atr1 = 1   THEN 1 ELSE 0 END
         , [2] =   CASE WHEN O.atr1 = 2   THEN 1 ELSE 0 END
         , [4] =   CASE WHEN O.atr2 = 4   THEN 1 ELSE 0 END
         , [8] =   CASE WHEN O.atr2 = 8   THEN 1 ELSE 0 END
         , [16] =  CASE WHEN O.atr3 = 16  THEN 1 ELSE 0 END
         , [32] =  CASE WHEN O.atr3 = 32  THEN 1 ELSE 0 END
         , [64] =  CASE WHEN O.atr4 = 64  THEN 1 ELSE 0 END
         , [128] = CASE WHEN O.atr4 = 128 THEN 1 ELSE 0 END
    FROM dbo.objects AS O
),
TargetDistribution AS (
    SELECT Target0   = ROUND(CAST(SUM([0]  ) AS FLOAT) / COUNT(*) * @TargetRowNum, 0)
         , Target1   = ROUND(CAST(SUM([1]  ) AS FLOAT) / COUNT(*) * @TargetRowNum, 0)
         , Target2   = ROUND(CAST(SUM([2]  ) AS FLOAT) / COUNT(*) * @TargetRowNum, 0)
         , Target4   = ROUND(CAST(SUM([4]  ) AS FLOAT) / COUNT(*) * @TargetRowNum, 0)
         , Target8   = ROUND(CAST(SUM([8]  ) AS FLOAT) / COUNT(*) * @TargetRowNum, 0)
         , Target16  = ROUND(CAST(SUM([16] ) AS FLOAT) / COUNT(*) * @TargetRowNum, 0)
         , Target32  = ROUND(CAST(SUM([32] ) AS FLOAT) / COUNT(*) * @TargetRowNum, 0)
         , Target64  = ROUND(CAST(SUM([64] ) AS FLOAT) / COUNT(*) * @TargetRowNum, 0)
         , Target128 = ROUND(CAST(SUM([128]) AS FLOAT) / COUNT(*) * @TargetRowNum, 0)
    FROM ValuesPivotted
),
SelectRows AS(
    SELECT VP.id
         , RowNum
         , KeepRow = 1
         , Target0    , Sum0   = [0]
         , Target1    , Sum1   = [1]
         , Target2    , Sum2   = [2]
         , Target4    , Sum4   = [4]
         , Target8    , Sum8   = [8]
         , Target16   , Sum16  = [16]
         , Target32   , Sum32  = [32]
         , Target64   , Sum64  = [64]
         , Target128  , Sum128 = [128]
    FROM ValuesPivotted AS VP
        CROSS JOIN TargetDistribution AS TD
    WHERE VP.RowNum = 1

    UNION ALL

    SELECT 
           VP.id
         , VP.RowNum
         , KeepRow = ISNULL(SkipRow.Value, 1)
         , Target0    , Sum0   = Sum0   + ISNULL(SkipRow.Value, [0]  )
         , Target1    , Sum1   = Sum1   + ISNULL(SkipRow.Value, [1]  )
         , Target2    , Sum2   = Sum2   + ISNULL(SkipRow.Value, [2]  )
         , Target4    , Sum4   = Sum4   + ISNULL(SkipRow.Value, [4]  )
         , Target8    , Sum8   = Sum8   + ISNULL(SkipRow.Value, [8]  )
         , Target16   , Sum16  = Sum16  + ISNULL(SkipRow.Value, [16] )
         , Target32   , Sum32  = Sum32  + ISNULL(SkipRow.Value, [32] )
         , Target64   , Sum64  = Sum64  + ISNULL(SkipRow.Value, [64] )
         , Target128  , Sum128 = Sum128 + ISNULL(SkipRow.Value, [128])
    FROM SelectRows AS SR
        INNER JOIN ValuesPivotted AS VP
            ON VP.RowNum = SR.RowNum + 1
        CROSS APPLY(
            SELECT  Value = 
                CASE WHEN   Sum0   + [0]   <= Target0 
                        AND Sum1   + [1]   <= Target1 
                        AND Sum2   + [2]   <= Target2 
                        AND Sum4   + [4]   <= Target4 
                        AND Sum8   + [8]   <= Target8 
                        AND Sum16  + [16]  <= Target16
                        AND Sum32  + [32]  <= Target32
                        AND Sum64  + [64]  <= Target64
                        AND Sum128 + [128] <= Target128
                    THEN NULL ELSE 0 END
        ) AS SkipRow
    WHERE  Sum0   < Target0 
        OR Sum1   < Target1 
        OR Sum2   < Target2 
        OR Sum4   < Target4 
        OR Sum8   < Target8 
        OR Sum16  < Target16
        OR Sum32  < Target32
        OR Sum64  < Target64
        OR Sum128 < Target128
)
SELECT O.*
FROM SelectRows AS SR
    INNER JOIN dbo.objects AS O
        ON SR.id = O.id
WHERE SR.KeepRow = 1
OPTION(MAXRECURSION 0)

编辑: SelectRows中的WHERE子句没有按预期执行,在满足所有目标时停止递归,现在确实如此。