Question

我们有一个数据集，其中用户（不同）拥有设施（多个），其中包含拥有（多个）的帐户（多个）。

我遇到了重复案例，例如： user_ID='A' facility_ID='1' account_ID in ('A','B) facility_ID='2' account_ID in ('C','D) count(accounts) sum(holdings amount)，holdings_amount和user_id facility_id facility_name account_id holdings_amount A 1 Fidelity A 100 A 1 Fidelity A 200 A 1 Fidelity B 300 A 1 Fidelity B 400 A 2 Fidelity C 200 A 2 Fidelity C 100 A 2 Fidelity D 400 A 2 Fidelity D 300 A 3 Fidelity E 100 A 3 Fidelity E 200 A 3 Fidelity F 700 A 4 Fidelity G 200 A 4 Fidelity G 100 A 4 Fidelity H 400 A 4 Fidelity H 300 user两个设施的值相同。

count(facilities) >1

SQL提供适当的数据： http://sqlfiddle.com/#!15/697f6/1

在facility_name级别，我想要做的是：

IF facility_name（请注意，可以是＆gt; 2）
AND count(accounts) = facility
AND count(accounts)来自另一个count(holdings_amount) = account
AND count(holdings_amount)来自另一个sum(holdings_amount) = account
AND sum(holdings_amount)来自另一个holdings amount = account
并且来自一个holdings amount的每个facility值等于另一个accounts值（按任意顺序）

然后排除重复user_id facility_id facility_name account_id holdings_amount A 1 Fidelity A 100 A 1 Fidelity A 200 A 1 Fidelity B 300 A 1 Fidelity B 400 A 3 Fidelity E 100 A 3 Fidelity E 200 A 3 Fidelity F 700 A 4 Fidelity G 200 A 4 Fidelity G 100 A 4 Fidelity H 400 A 4 Fidelity H 300的计数（即与其关联的typedef）。

所以预期的输出是：

struct set {
  void **elements;              /* array of elements    */
  int nElem;                    /* array count          */
  size_t elemSize;              /* size of element type */
  int(*cmpFunc)(void*, void*);  /* equality comparison  */
};

由于设施2违反所有6个点，设施3不违反第4点，设施4不违反第6点。

如果有任何不清楚或我是否可以提供更多细节，请告诉我。谢谢！

Answer 1

这里有我的想法，虽然它似乎不会在你的小提琴中返回结果。

select
    a2.id,
    count(h1.id), count(h2.id), count(distinct a1.id), count(distinct a2.id)
from
    (
        facilities f1
        inner join accounts a1 on a1.facility_id = f1.id
        inner join holdings h1 on h1.acc_id = a1.id
    )
    full outer join
    (
        facilities f2
        inner join accounts a2 on a2.facility_id = f2.id
        inner join holdings h2 on h2.acc_id = a2.id)
    on      f2.id <> f1.id
        and a2.id > a1.id
        and f2.facility_name = f1.facility_name
        and h2.holdings_amount = h1.holdings_amount
group by a2.id
having
        count(h1.id) = count(h2.id)
    and count(distinct a1.id) = count(distinct a2.id)
    and sum(h1.holdings_amount) = sum(h2.holdings_amount)
    and count(h1.id) = count(*) and count(h2.id) = count(*);

回过头来，我意识到你确实对多个级别有限制，而这个级别不会由此处理。这可能会帮助您走上正确的轨道，但我可以想到一些问题。

Answer 2

with f_agg as (
    select f.user_id, f.id, f.facility_name,
        count(distinct a.id)  as a_cnt,
        count(distinct h.id) as h_cnt,
        sum(h.holdings_amount) as h_tot,
        sum(cast(h.id as int)) as h_chk
    from
        facilities f
        inner join accounts a on a.facility_id = f.id
        inner join holdings h on h.acc_id = a.id
    group by f.user_id, f.id, f.facility_name
), potential as (
    select fa1.id as id1, fa2.id as id2
    from f_agg as fa1 cross join f_agg as fa2
    where fa2.id > fa1.id
            and fa2.user_id = fa1.user_id
            and fa2.facility_name = fa1.facility_name
            and fa2.a_cnt = fa1.a_cnt
            and fa2.h_cnt = fa1.h_cnt
            and fa2.h_tot = fa1.h_tot
),
matches as (
    select coalesce(p1.id1, p2.id1) as id1, coalesce(p1.id2, p2.id2) as id2
    from
        (
        potential p1
        inner join f_agg fa1 on fa1.id = p1.id1
        inner join accounts a1 on a1.facility_id = fa1.id
        inner join
            (
            select *, row_number() over (partition by acc_id order by id) as rn
            from holdings
            ) h1 on h1.acc_id = a1.id
        )
        full outer join
        (
        potential p2
        inner join f_agg fa2 on fa2.id = p2.id2
        inner join accounts a2 on a2.facility_id = fa2.id  
        inner join 
            (
            select *, row_number() over (partition by acc_id order by id) as rn
            from holdings
            ) h2 on h2.acc_id = a2.id
        )
        on      p2.id1 = p1.id1 and p2.id2 = p1.id2
            and h2.rn = h1.rn and h2.holdings_amount = h1.holdings_amount
    group by coalesce(p1.id1, p2.id1), coalesce(p1.id2, p2.id2)
    having   count(h1.id) = count(*)
         and count(h2.id) = count(*)
         and sum(cast(h1.id as int)) = min(fa1.h_chk)
         and sum(cast(h2.id as int)) = min(fa2.h_chk)
)
select * from matches;

离开这里，以防我回来玩更多：http://sqlfiddle.com/#!15/697f6/120

使用聚合计算检查进行重复数据删除

2 个答案: