通过交叉点连接2个数据集

时间:2019-02-01 09:34:07

标签: sql oracle intersection

我以前发布了此问题的一个版本,但是在使用这种略有不同的数据格式时,我一直在努力寻找答案,因此我在此方面再次伸出援手。

我有以下数据集(请注意,读取下面数据的方式是ID1,福特,具有以下属性和值A:B,B:C和C:D

+------------------------------------------------+
| ID     NAME     Attribute      Attribute Value |
+------------------------------------------------+
| 1      Ford         A                  B       |
| 1      Ford         B                  C       |
| 1      Ford         C                  D       |
| 2      BMW          A                  B       |
| 2      BMW          C                  D       |
| 2      BMW          F                  G       |
| 3      TESLA        Z                  Y       |
| 3      TESLA        E                  F       |
| 3      TESLA        A                  B       |
+------------------------------------------------+

我基本上想将表中的每个ID与其他ID进行比较,然后输出结果。第一个比较是对照2和3检查ID 1 并进行比较,查看匹配项在哪里,哪里不匹配。

输出(仅完成第一次比较vs仅记录1条):

+----------------------------------------------------------------------------+
| BaseID  BaseNAME   Target ID   TargetName    MatchedOn    Baseonly Tgtonly |
+----------------------------------------------------------------------------+
| 1        Ford         2          BMW           A:B;C:D     B:C     F:G     |
+----------------------------------------------------------------------------+

以前,一个好心的人帮助我实现了笛卡尔积,但是数据的格式略有不同-但这太慢了。所以我想看看是否有人对达到预期结果的最佳方法有任何想法?

2 个答案:

答案 0 :(得分:1)

在Oracle 12+中工作。

在11g中,您可以使用listagg或UDF连接集合元素。

with
function collagg(p in sys.ku$_vcnt) return varchar2 is
result varchar2(4000);
begin
  for i in 1..p.count loop result := result || '; ' || p(i); end loop;
  return(substr(result,2));
end;
t(id, name, attr, val) as
( select 1, 'Ford',  'A', 'B' from dual union all
  select 1, 'Ford',  'B', 'C' from dual union all
  select 1, 'Ford',  'C', 'D' from dual union all
  select 2, 'BMW',   'A', 'B' from dual union all
  select 2, 'BMW',   'C', 'D' from dual union all
  select 2, 'BMW',   'F', 'G' from dual union all
  select 3, 'TESLA', 'Z', 'Y' from dual union all
  select 3, 'TESLA', 'E', 'F' from dual union all
  select 3, 'TESLA', 'A', 'B' from dual)
, t0 as
(select id, name, 
        cast(collect(cast(attr||':'||val as varchar2(4000))) as sys.ku$_vcnt) c
   from t t1
  group by id, name)
select t1.id baseid,
       t1.name basename,
       t2.id tgtid,
       t2.name tgtname,
       collagg(t1.c multiset intersect t2.c) matchedon,
       collagg(t1.c multiset except t2.c) baseonly,
       collagg(t2.c multiset except t1.c) tgtonly
  from t0 t1 join t0 t2 on t1.id < t2.id;

答案 1 :(得分:0)

这可能更快:

with 
  t1 as (select distinct a.id ia, a.name na, b.id ib, b.name nb 
           from t a join t b on a.id < b.id),
  t2 as (
    select ia, na, ib, nb, 
           cast(multiset(select attr||':'||val from t where id = ia intersect 
                         select attr||':'||val from t where id = ib ) 
                as sys.odcivarchar2list) a1, 
           cast(multiset(select attr||':'||val from t where id = ia minus 
                         select attr||':'||val from t where id = ib ) 
                as sys.odcivarchar2list) a2, 
           cast(multiset(select attr||':'||val from t where id = ib minus 
                         select attr||':'||val from t where id = ia ) 
                as sys.odcivarchar2list) a3 
      from t1)
select ia, na, ib, nb, 
       (select listagg(column_value, ';') within group (order by null) from table(t2.a1)) l1,
       (select listagg(column_value, ';') within group (order by null) from table(t2.a2)) l2,
       (select listagg(column_value, ';') within group (order by null) from table(t2.a3)) l3
  from t2
  order by ia, ib

dbfiddle demo

  • 子查询t1创建了将要比较的成对的“汽车”
  • t2为每一对收集相同或不同属性的集合。 sys.odcivarchar2list是内置类型,仅是字符串表
  • 最终查询将集合更改为字符串列表。结果:

    IA NA            IB NB    L1        L2           L3
    -- ------------ --- ----- --------- ------------ -----------
     1 Ford           2 BMW   A:B;C:D   B:C          F:G
     1 Ford           3 TESLA A:B       B:C;C:D      E:F;Z:Y
     2 BMW            3 TESLA A:B       C:D;F:G      E:F;Z:Y
    

我希望它可以更快,因为我们没有使用任何用户定义的函数,并且将操作次数减至最少。

替代方法是使用类似此功能的内容:

-- find different or common attributes
create or replace function dca(i1 in number, i2 in number, op in char) 
  return varchar2 is 
  ret varchar2(1000);
begin 
  case op 
    when 'M' then -- minus
      select listagg(attr||':'||val, ';') within group (order by null) into ret
        from (select attr, val from t where id = i1 minus 
              select attr, val from t where id = i2 );
    when 'I' then -- intersect
      select listagg(attr||':'||val, ';') within group (order by null) into ret
        from (select attr, val from t where id = i1 intersect 
              select attr, val from t where id = i2 );
  end case;
  return ret;
end;

在此查询中:

select ia, na, ib, nb, 
       dca(ia, ib, 'I') ab, dca(ia, ib, 'M') a_b, dca(ib, ia, 'M') b_a 
  from (select distinct a.id ia, a.name na, b.id ib, b.name nb 
          from t a join t b on a.id < b.id)
  order by ia, ib;

它也可以工作,但是这是UDF,在查询中表现较差。