对不同的重复行求和

时间:2019-10-08 16:49:12

标签: sum duplicates db2 olap partition

tl / dr 总结:3个具有层次关系的表,一个中间级别的数字字段,需要该数字的总和而不由于较低级别而重复,寻找使用OLAP的替代方法DB2中的功能。

这在一定程度上重新审视了这两个主题(SUM(DISTINCT) Based on Other ColumnsSum Values based on Distinct Guids),但是我作为一个单独的主题感到困惑,因为我想知道是否有一种方法可以使用OLAP函数来完成此任务。

我正在DB2中工作。该方案(由于客户端的机密性,不是实际的表)是:

   Table: NEIGHBORHOOD, field NEIGHBORHOOD_NAME
   Table: HOUSEHOLD, fields NEIGHBORHOOD_NAME, HOUSEHOLD_NAME, and HOUSEHOLD_INCOME
   Table: HOUSEHOLD_MEMBER, fields HOUSEHOLD_NAME, PERSON_NAME

现在,我们已经通过单个flatten-it-all视图获取了数据。 这样我们就可以得到

 Shady Acres, 123 Shady Lane, 25000, Jane
 Shady Acres, 123 Shady Lane, 25000, Mary
 Shady Acres, 123 Shady Lane, 25000, Robert
 Shady Acres, 126 Shady Lane, 15000, George
 Shady Acres, 126 Shady Lane, 15000, Tom
 Shady Acres, 126 Shady Lane, 15000, Betsy
 Shady Acres, 126 Shady Lane, 15000, Timmy

如果我想要

    Shady Acres, 123 Shady Lane, 25000, 3  (household income, count of members)
    Shady Acres, 125 Shady Lane, 15000, 4

没问题:

SELECT N.NEIGHBORHOOD_NAME, H.HOUSEHOLD_NAME, H.HOUSEHOLD_INCOME, count(1)
from NEIGHBORHOOD N join HOUSEHOLD H on N.HOUSEHOLD_NAME = H.HOUSEHOLD_NAME
join HOUSEHOLD_MEMBER M on H.HOUSEHOLD_NAME = M.HOUSEHOLD_NAME
group by N.NEIGHBORHOOD_NAME, H.HOUSEHOLD_NAME, H.HOUSEHOLD_INCOME

但是,如果我想要

   Shady Acres, 2, 40000, 7 (i.e. neighborhood, number of households, sum of income, count of members)

如相关链接所示,如果没有子查询,我将无法完成。

到目前为止,我最好的是

select NEIGHBORHOOD.NEIGHBORHOOD_NAME,
count(distinct HOUSEHOLD.HOUSEHOLD_NAME) household_Count,
sum(distinct HOUSEHOLD.HOUSEHOLD_INCOME) total_income,
count(1) household_members group by N.NEIGHBORHOOD_NAME

当然,如果您有两个收入相同的家庭,这将不起作用。坦率地说,“ sum(distinct)”甚至有效,这让我感到很惊讶,因为这对我来说毫无意义。

我尝试了

sum(household_income) over (partition by household.household_name) 

并抛出错误:

  

An‬‪表达式‬‪starting‬‪with‬‬“ HOUSEHOLD_INCOME”‬‪specified‬‪in‬‪a‬‪SELECTSELECT‬clause‬,‪‬ ‪HAVING‬‪clause‬,‪‬‪or‬‪ORDER‬‪BY‬‪clause‬‪is‬‪not‬‪specified‬‪in‪‪the‬‪GROUP ‬‪BY‬‪clause‬‪or‬‪it‬‪is‬‪in‬‪a‬‪SELECT‬‪clause‬,‪‬‪HAVING‬‪clause‬ ,‪‬‪‬or‬‪ORDER‬‬BY‬‪clause‬‪with‬‪a‬‪column‬‪function‬‪and‬‪no‬‪GROUP‬‪BY ‪clause‬‬is‪‬specified‬.‪‬.‪‬‪SQLCODE‬=‪‬-‪119‬,‪‬‪SQLSTATE‬=‪42803‬,‪ ‬‪DRIVER‬=‪4‬.‪19‬.‪56

尝试将HOUSEHOLD_INCOME或HOUSEHOLD_NAME添加到分组会导致错误的结果,因为我们不想按这些字段进行分类。

除了使用子查询之外,没有其他解决方案是完全有可能的,但是我们必须对基础视图进行一些重大的重新设计(包括添加其他视图),因此我认为问这个问题不会有任何伤害。

4 个答案:

答案 0 :(得分:0)

我同意,如果您想使用OLAP功能,那么没有子查询是不可能的

可以使用相关的子选择,但是这种选择不够优雅,效率低下,而且可能不是您想要的

WITH NEIGHBORHOOD(NEIGHBORHOOD_NAME) AS (VALUES ('Shady Acres'))
,   HOUSEHOLD (NEIGHBORHOOD_NAME, HOUSEHOLD_NAME, HOUSEHOLD_INCOME)
AS (VALUES 
   ('Shady Acres', '123 Shady Lane', 25000)
  ,('Shady Acres', '126 Shady Lane', 15000) 
  )
, HOUSEHOLD_MEMBER ( HOUSEHOLD_NAME, PERSON_NAME )
AS(VALUES
      ('123 Shady Lane', 'Jane'  )
     ,('123 Shady Lane', 'Mary'  )
     ,('123 Shady Lane', 'Robert')
     ,('126 Shady Lane', 'George')
     ,('126 Shady Lane', 'Tom'   )
     ,('126 Shady Lane', 'Betsy' )
     ,('126 Shady Lane', 'Timmy' )    
)
SELECT
    NEIGHBORHOOD_NAME
,   COUNT(DISTINCT HOUSEHOLD_NAME  )  AS HOUSEHOLD_COUNT
--,   SUM(DISTINCT   HOUSEHOLD_INCOME)  AS TOTAL_INCOME       -- not valid if two househols have the same income
--,   SUM(HOUSEHOLD_INCOME) OVER (PARTITION BY HOUSEHOLD_NAME)  -- not valid unless we GROUP BY HOUSEHOLD_NAME in the main body
,   SUM( (SELECT SUM(S.HOUSEHOLD_INCOME) FROM HOUSEHOLD S 
            WHERE S.HOUSEHOLD_NAME = H.HOUSEHOLD_NAME
            AND   M.PERSON_NAME = (SELECT MAX(SS.PERSON_NAME) 
                                   FROM HOUSEHOLD_MEMBER SS
                                   WHERE SS.HOUSEHOLD_NAME = H.HOUSEHOLD_NAME))
           )            AS TOTAL_INCOME
,   COUNT(1)                          AS HOUSEHOLD_MEMBERS
FROM NEIGHBORHOOD N
JOIN HOUSEHOLD    H     USING ( NEIGHBORHOOD_NAME )
JOIN HOUSEHOLD_MEMBER M USING ( HOUSEHOLD_NAME )
GROUP BY N.NEIGHBORHOOD_NAME

返回

 NEIGHBORHOOD_NAME  HOUSEHOLD_COUNT     TOTAL_INCOME    HOUSEHOLD_MEMBERS
 -----------------  ---------------     ------------    -----------------
 Shady Acres                      2     40000           7

答案 1 :(得分:0)

另一个选项可能是

with base (NEIGHBORHOOD_NAME,HOUSEHOLD_NAME, HOUSEHOLD_INCOME, HOUSEHOLD_MEMBER) as (
 values ('Shady Acres', '123 Shady Lane', 25000, 'Jane')
,( 'Shady Acres', '123 Shady Lane', 25000, 'Mary')
,( 'Shady Acres', '123 Shady Lane', 25000, 'Robert')
,( 'Shady Acres', '126 Shady Lane', 15000, 'George')
,( 'Shady Acres', '126 Shady Lane', 15000, 'Tom')
,( 'Shady Acres', '126 Shady Lane', 15000, 'Betsy')
,( 'Shady Acres', '126 Shady Lane', 15000, 'Timmy')
)
, temp as (
select NEIGHBORHOOD_NAME,HOUSEHOLD_NAME, HOUSEHOLD_INCOME, HOUSEHOLD_MEMBER
     , row_number() over (partition by NEIGHBORHOOD_NAME,HOUSEHOLD_NAME order by HOUSEHOLD_MEMBER) as rownum_asc
     , row_number() over (partition by NEIGHBORHOOD_NAME,HOUSEHOLD_NAME order by HOUSEHOLD_MEMBER desc) as rownum_desc
FROM base
)
SELECT NEIGHBORHOOD_NAME, sum(HOUSEHOLD_INCOME) as TOTAL_INCOME, sum(rownum_desc) as member_count

  FROM temp
 WHERE rownum_asc = 1
 GROUP BY NEIGHBORHOOD_NAME

行编号在两个方向上都有一个小技巧-一个仅选择每个家庭中的一行,另一个对成员进行计数,以便总和最终可以完成工作。

OLAP函数不会减少行数-这是市长的区别,并且与GROUP BY和column函数相比,OLAP函数通常具有优势。但是,如果您将基本表展平,则需要使用GROUP BY。

答案 2 :(得分:0)

所以这将是一个非常棘手的解决方案,只要您在非规范化/重复数据的键上没有散列冲突并且您的重复项不超过我按下的小数位数,该方法就可以工作HASH值不影响SUM()

WITH NEIGHBORHOOD(NEIGHBORHOOD_NAME) AS (VALUES ('Shady Acres'))
,   HOUSEHOLD (NEIGHBORHOOD_NAME, HOUSEHOLD_NAME, HOUSEHOLD_INCOME)
AS (VALUES 
   ('Shady Acres', '123 Shady Lane', 25000)
  ,('Shady Acres', '126 Shady Lane', 25000) 
  )
, HOUSEHOLD_MEMBER ( HOUSEHOLD_NAME, PERSON_NAME )
AS(VALUES
      ('123 Shady Lane', 'Jane'  )
     ,('123 Shady Lane', 'Mary'  )
     ,('123 Shady Lane', 'Robert')
     ,('126 Shady Lane', 'George')
     ,('126 Shady Lane', 'Tom'   )
     ,('126 Shady Lane', 'Betsy' )
     ,('126 Shady Lane', 'Timmy' )    
)
SELECT
    NEIGHBORHOOD_NAME
,   COUNT(DISTINCT HOUSEHOLD_NAME  )  AS HOUSEHOLD_COUNT
,   BIGINT(SUM(DISTINCT DECFLOAT(HOUSEHOLD_INCOME || '.000000' || ABS(HASH4(HOUSEHOLD_NAME))))) AS TOTAL_INCOME
,   COUNT(1)                          AS HOUSEHOLD_MEMBERS
FROM NEIGHBORHOOD N
JOIN HOUSEHOLD    H     USING ( NEIGHBORHOOD_NAME )
JOIN HOUSEHOLD_MEMBER M USING ( HOUSEHOLD_NAME )
GROUP BY N.NEIGHBORHOOD_NAME

返回

 NEIGHBORHOOD_NAME  HOUSEHOLD_COUNT     TOTAL_INCOME    HOUSEHOLD_MEMBERS
 -----------------  ---------------     ------------    -----------------
 Shady Acres                      2        50000                    7

请注意,我使两个家庭的收入相同,以证明该解决方案在这种情况下有效

我猜您可能会争辩说,SQL缺少某些语法功能,即使用DISTINCT关键字的OLAP函数应该能够将正在定义的内容与正在聚合的内容分开定义。

答案 3 :(得分:0)

以下查询返回您需要的结果:

l =[e for i in status for e in i]