使用聚合计算检查进行重复数据删除

时间:2018-03-09 17:22:20

标签: sql amazon-redshift

我们有一个数据集,其中用户(不同)拥有设施(多个),其中包含拥有(多个)的帐户(多个)。

我遇到了重复案例,例如: user_ID='A' facility_ID='1' account_ID in ('A','B) facility_ID='2' account_ID in ('C','D) count(accounts) sum(holdings amount)holdings_amountuser_id facility_id facility_name account_id holdings_amount A 1 Fidelity A 100 A 1 Fidelity A 200 A 1 Fidelity B 300 A 1 Fidelity B 400 A 2 Fidelity C 200 A 2 Fidelity C 100 A 2 Fidelity D 400 A 2 Fidelity D 300 A 3 Fidelity E 100 A 3 Fidelity E 200 A 3 Fidelity F 700 A 4 Fidelity G 200 A 4 Fidelity G 100 A 4 Fidelity H 400 A 4 Fidelity H 300 user两个设施的值相同

count(facilities) >1

SQL提供适当的数据: http://sqlfiddle.com/#!15/697f6/1

facility_name级别,我想要做的是:

  1. IF facility_name(请注意,可以是> 2)
  2. AND count(accounts) = facility
  3. AND count(accounts)来自另一个count(holdings_amount) = account
  4. AND count(holdings_amount)来自另一个sum(holdings_amount) = account
  5. AND sum(holdings_amount)来自另一个holdings amount = account
  6. 并且来自一个holdings amount的每个facility值等于另一个accounts值(按任意顺序)
  7. 然后排除重复user_id facility_id facility_name account_id holdings_amount A 1 Fidelity A 100 A 1 Fidelity A 200 A 1 Fidelity B 300 A 1 Fidelity B 400 A 3 Fidelity E 100 A 3 Fidelity E 200 A 3 Fidelity F 700 A 4 Fidelity G 200 A 4 Fidelity G 100 A 4 Fidelity H 400 A 4 Fidelity H 300 的计数(即与其关联的typedef)。

    所以预期的输出是:

    struct set {
      void **elements;              /* array of elements    */
      int nElem;                    /* array count          */
      size_t elemSize;              /* size of element type */
      int(*cmpFunc)(void*, void*);  /* equality comparison  */
    };
    

    由于设施2违反所有6个点,设施3不违反第4点,设施4不违反第6点。

    如果有任何不清楚或我是否可以提供更多细节,请告诉我。谢谢!

2 个答案:

答案 0 :(得分:1)

这里有我的想法,虽然它似乎不会在你的小提琴中返回结果。

select
    a2.id,
    count(h1.id), count(h2.id), count(distinct a1.id), count(distinct a2.id)
from
    (
        facilities f1
        inner join accounts a1 on a1.facility_id = f1.id
        inner join holdings h1 on h1.acc_id = a1.id
    )
    full outer join
    (
        facilities f2
        inner join accounts a2 on a2.facility_id = f2.id
        inner join holdings h2 on h2.acc_id = a2.id)
    on      f2.id <> f1.id
        and a2.id > a1.id
        and f2.facility_name = f1.facility_name
        and h2.holdings_amount = h1.holdings_amount
group by a2.id
having
        count(h1.id) = count(h2.id)
    and count(distinct a1.id) = count(distinct a2.id)
    and sum(h1.holdings_amount) = sum(h2.holdings_amount)
    and count(h1.id) = count(*) and count(h2.id) = count(*);

回过头来,我意识到你确实对多个级别有限制,而这个级别不会由此处理。这可能会帮助您走上正确的轨道,但我可以想到一些问题。

答案 1 :(得分:0)

with f_agg as (
    select f.user_id, f.id, f.facility_name,
        count(distinct a.id)  as a_cnt,
        count(distinct h.id) as h_cnt,
        sum(h.holdings_amount) as h_tot,
        sum(cast(h.id as int)) as h_chk
    from
        facilities f
        inner join accounts a on a.facility_id = f.id
        inner join holdings h on h.acc_id = a.id
    group by f.user_id, f.id, f.facility_name
), potential as (
    select fa1.id as id1, fa2.id as id2
    from f_agg as fa1 cross join f_agg as fa2
    where fa2.id > fa1.id
            and fa2.user_id = fa1.user_id
            and fa2.facility_name = fa1.facility_name
            and fa2.a_cnt = fa1.a_cnt
            and fa2.h_cnt = fa1.h_cnt
            and fa2.h_tot = fa1.h_tot
),
matches as (
    select coalesce(p1.id1, p2.id1) as id1, coalesce(p1.id2, p2.id2) as id2
    from
        (
        potential p1
        inner join f_agg fa1 on fa1.id = p1.id1
        inner join accounts a1 on a1.facility_id = fa1.id
        inner join
            (
            select *, row_number() over (partition by acc_id order by id) as rn
            from holdings
            ) h1 on h1.acc_id = a1.id
        )
        full outer join
        (
        potential p2
        inner join f_agg fa2 on fa2.id = p2.id2
        inner join accounts a2 on a2.facility_id = fa2.id  
        inner join 
            (
            select *, row_number() over (partition by acc_id order by id) as rn
            from holdings
            ) h2 on h2.acc_id = a2.id
        )
        on      p2.id1 = p1.id1 and p2.id2 = p1.id2
            and h2.rn = h1.rn and h2.holdings_amount = h1.holdings_amount
    group by coalesce(p1.id1, p2.id1), coalesce(p1.id2, p2.id2)
    having   count(h1.id) = count(*)
         and count(h2.id) = count(*)
         and sum(cast(h1.id as int)) = min(fa1.h_chk)
         and sum(cast(h2.id as int)) = min(fa2.h_chk)
)
select * from matches;

离开这里,以防我回来玩更多:http://sqlfiddle.com/#!15/697f6/120