学术之谜:在没有自我加入的情况下推导出比例

时间:2018-05-09 10:00:42

标签: sql hive hiveql

我们有数据到达以下结构

entity_id   entity_value   category_id   category_weight   group_id   group_weight
    1            100            11               6            101          4
    1            100            11               6            102          3
    1            100            12               5            102          3
    1            100            12               5            103          2
    1            100            13               6            101          4

实体可以属于任何类别和任何组合中的任何组,类别成员资格和组成员资格之间不存在隐含关系。

数据冗余但一致;如果一行表示类别11的权重为6,则所有行都会说类别11的权重为6.这同样适用于群组及其权重。

一行数据由{entity_id, category_id, group_id}唯一标识。


目的是根据各种权重在所有行中分配实体的值。首先,按类别分配,然后按组分配。


第1步:按类别分摊

  

实体1与3个类别{11,12,13}相关联,权重为{6,5,4}

     
    

将100 *(6 /(6 + 5 + 6))分配给类别11 => 35.29
    将100 *(5 /(6 + 5 + 6))分配给类别12 => 29.41
    将100 *(6 /(6 + 5 + 6))分配给类别13 => 35.29

  

第2步:按小组分配这些结果

  

Entity1Category11与群组{101,102}相关联,权重为{4,3}

     
    

将35.29 *(4 /(4 + 3))分配给组101 => 20.17
    将35.29 *(3 /(4 + 3))分配给组102 => 15.12

  
     

Entity1Category12与群组{102,103}相关联,权重为{3,2}

     
    

将29.41 *(3 /(3 + 2))分配给组102 => 17.65
    将29.41 *(2 /(3 + 2))分配给组103 => 11.76

  
     

Entity1Category13与权重为{4}

的群组{101}相关联      
    

将35.29 *(4 /(4))分配给组103 => 35.29

  


我可以用窗口函数做第二步。干净整洁,没有自我加入。

然而,第一步似乎需要子查询和自我加入。

例如...... http://sqlfiddle.com/#!18/be890/1

SELECT
  sample.entity_id,
  sample.category_id,
  sample.group_id,
  sample.entity_value   AS original_value,
  sample.entity_value
  * (sample.category_weight / entity.total_category_weight)
  * (sample.group_weight    / SUM(sample.group_weight) OVER (PARTITION BY sample.entity_id, sample.category_id))
    AS apportioned_value
FROM
(
  SELECT
    entity_id,
    SUM(category_weight)   AS total_category_weight
  FROM
  (
    SELECT
      entity_id,
      category_id,
      MAX(category_weight)   AS category_weight
    FROM
      sample
    GROUP BY
      entity_id,
      category_id
  )
    entity_category
  GROUP BY
    entity_id
)
  entity
INNER JOIN
  sample
    ON sample.entity_id = entity.entity_id

是否有更整洁的方式,无需自我加入?

1 个答案:

答案 0 :(得分:0)

SELECT
  entity_id,
  category_id,
  group_id,
  entity_value   AS original_value,
  entity_value
  * (category_weight / SUM(scaled_cat_weight) OVER (PARTITION BY entity_id             ))
  * (group_weight    / SUM(group_weight     ) OVER (PARTITION BY entity_id, category_id))
    AS apportioned_value
FROM
(
  SELECT
    *,
    category_weight / COUNT(*) OVER (PARTITION BY entity_id, category_id)   AS scaled_cat_weight
  FROM
    sample
)
  scaled
ORDER BY
  entity_id,
  category_id,
  group_id