如何复制SAS合并

时间:2015-08-27 17:43:03

标签: sql merge sas outer-join

我有两个表,t1和t2:

t1
  person | visit | code1 | type1
       1       1      50      50 
       1       1      50      50 
       1       2      75      50 

t2
  person | visit | code2 | type2
       1       1      50      50 
       1       1      50      50 
       1       1      50      50 

当SAS运行以下代码时:

   DATA t3;
     MERGE t1 t2;
     BY person visit;

   RUN;

它生成以下数据集:

       person | visit | code1 | type1 | code2 | type2
            1       1      50      50      50      50
            1       1      50      50      50      50
            1       1      50      50      50      50
            1       2      75      50

我想在SQL中复制这个过程,我的想法是使用全外连接。这有效,除非有重复的行。当我们有像上面例子中那样的重复行时,完整的外部联接会产生下表:

       person | visit | code1 | type1 | code2 | type2
            1       1      50      50      50      50
            1       1      50      50      50      50
            1       1      50      50      50      50
            1       1      50      50      50      50
            1       1      50      50      50      50
            1       1      50      50      50      50
            1       2      75      50

我想知道如何让SQl表与SAS表匹配。

2 个答案:

答案 0 :(得分:2)

戈登的答案很接近;但它错过了一点。这是它的输出:

person  visit   code1   type1   seqnum  person  visit   code2   type2   seqnum
1       1       1       1       1       1       1       1       1       1
1       1       2       2       2       1       1       2       2       2
NULL    NULL    NULL    NULL    NULL    1       1       3       3       3
1       2       1       3       1       NULL    NULL    NULL    NULL    NULL

第三行的空值不正确,而第四行的空值是正确的。

据我所知,在SQL中除了将事情分解为几个查询之外,没有其他方法可以做到这一点。我认为有五种可能性:

  • 匹配人/访问,匹配seqnums
  • 匹配人/访问,Left有更多seqnums
  • 匹配人/访问,右有更多seqnums
  • 左派有无与伦比的人/访问
  • 对,有无与伦比的人/访问

我认为最后两个可能在一个查询中可行,但我认为第二个和第三个必须是单独的查询。当然,你可以将所有东西结合在一起。

所以这是一个例子,使用一些更适合查看正在发生的事情的临时表。请注意,第三行现在已填入code1type1,即使这些是'额外'。我只添加了五个标准中的三个 - 你在最初的例子中有三个 - 但其他两个并不太难。

请注意,这是SAS中 far 更快的一个示例 - 因为SAS具有行方式概念,即它一次能够行一行。使用大型表时,SQL往往需要花费更长的时间,除非可以非常巧妙地对事物进行分区并且具有非常好的索引 - 即使这样,我也从未见过SQL DBA在某些地方与SAS相处问题类型。这当然是你必须接受的 - SQL有其自身的优势,其中一个可能是价格......

这是我的示例代码。我确信它不是非常优雅,希望其中一个SQL用户可以改进它。这是为了在SQL Server中工作(使用表变量),同样的事情应该适用于其他变体中的一些更改(使用临时表),假设它们实现了窗口化。 (SAS当然不能做这个特别的事情 - 因为即使FedSQL实现ANSI 1999,也不是ANSI 2008.)这是基于Gordon的初始查询,然后用最后的附加位进行修改。任何想要改进这一点的人都可以随意编辑和/或复制到新的/现有的答案。

declare @t1 table (person INT, visit INT, code1 INT, type1 INT);
declare @t2 table (person INT, visit INT, code2 INT, type2 INT);


insert into @t1 values (1,1,1,1)
insert into @t1 values (1,1,2,2)
insert into @t1 values (1,2,1,3)

insert into @t2 values (1,1,1,1)
insert into @t2 values (1,1,2,2)
insert into @t2 values (1,1,3,3)

select coalesce(t1.person, t2.person) as person, coalesce(t1.visit, t2.visit) as visit,
                t1.code1, t1.type1, t2.code2, t2.type2
from (select *,
             row_number() over (partition by person, visit order by type1) as seqnum
      from @t1
     ) t1 inner join
     (select *,
             row_number() over (partition by person, visit order by type2) as seqnum
      from @t2
     ) t2
     on t1.person = t2.person and t1.visit = t2.visit and
        t1.seqnum = t2.seqnum
 union all

select coalesce(t1.person, t2.person) as person, coalesce(t1.visit, t2.visit) as visit,
                t1.code1, t1.type1, t2.code2, t2.type2
from (
      (select person, visit, MAX(seqnum) as max_rownum from (
        select person, visit, 
             row_number() over (partition by person, visit order by type1) as seqnum
      from @t1) t1_f 
      group by person, visit
     ) t1_m inner join
     (select *, row_number() over (partition by person, visit order by type1) as seqnum
       from @t1
      ) t1 
        on t1.person=t1_m.person and t1.visit=t1_m.visit
        and t1.seqnum=t1_m.max_rownum
        inner join
     (select *,
             row_number() over (partition by person, visit order by type2) as seqnum
      from @t2
     ) t2
     on t1.person = t2.person and t1.visit = t2.visit and
        t1.seqnum < t2.seqnum 
     )
 union all
 select t1.person, t1.visit, t1.code1, t1.type1, t2.code2, t2.type2
     from @t1 t1 left join @t2 t2
    on t2.person=t1.person and t2.visit=t1.visit
    where t2.code2 is null

答案 1 :(得分:1)

您可以通过向每个表添加row_number()来复制SAS合并:

select t1.*, t2.*
from (select t1.*,
             row_number() over (partition by person, visit order by ??) as seqnum
      from t1
     ) t1 full outer join
     (select t2.*,
             row_number() over (partition by person, visit order by ??) as seqnum
      from t2
     ) t2
     on t1.person = t2.person and t1.visit = t2.visit and
        t1.seqnum = t2.seqnum;

注意:

  • ??表示放入用于订购的列。 SAS数据集具有内在顺序。 SQL表没有,因此需要指定排序。
  • 您应该明确列出列(而不是在外部查询中使用t1.*, t2.*)。我认为SAS只在结果数据集中包含personvisit

编辑:

注意:上面为键列生成单独的值。这很容易解决:

select coalesce(t1.person, t2.person) as person,
       coalesce(t1.key, t2.key) as key,
       t1.code1, t1.type1, t2.code2, t2.type2
from (select t1.*,
             row_number() over (partition by person, visit order by ??) as seqnum
      from t1
     ) t1 full outer join
     (select t2.*,
             row_number() over (partition by person, visit order by ??) as seqnum
      from t2
     ) t2
     on t1.person = t2.person and t1.visit = t2.visit and
        t1.seqnum = t2.seqnum;

修复了列问题。您可以使用first_value() / last_value()或使用更复杂的join条件来解决复制问题:

select coalesce(t1.person, t2.person) as person,
       coalesce(t1.visit, t2.visit) as visit,
       t1.code1, t1.type1, t2.code2, t2.type2
from (select t1.*,
             count(*) over (partition by person, visit) as cnt,
             row_number() over (partition by person, visit order by ??) as seqnum
      from t1
     ) t1 full outer join
     (select t2.*,
             count(*) over (partition by person, visit) as cnt,
             row_number() over (partition by person, visit order by ??) as seqnum
      from t2
     ) t2
     on t1.person = t2.person and t1.visit = t2.visit and
        (t1.seqnum = t2.seqnum or
        (t1.cnt > t2.cnt and t1.seqnum > t2.seqnum and t2.seqnum = t2.cnt) or
        (t2.cnt > t1.cnt and t2.seqnum > t1.seqnum and t1.seqnum = t1.cnt)

这实现了&#34;保留最后一行&#34;单个连接中的逻辑。可能出于性能原因,您可能希望将其放在原始逻辑上的单独left join中。