Question

I am trying to merge two large (million+) datasets in SAS. I'm pretty new to SAS and this is my first stackexchange question so hopefully the following makes sense...

SETUP:

All observations in the "Master" dataset have a unique identifier var1 and some have unique identifier var2. Some observations in the "Addition" dataset have unique identifier var1 and some have unique identifier var2; some observations have var2 but not var2.

I want to merge in all matches from the Addition dataset on EITHER var1 or var2 into the Master dataset.

METHODS I HAVE EXPLORED:

Option A: proc sql left join on var1 OR var2. Unfortunately, because there are multiple missing observations on var2 in both Master and Addition this runs into a Cartesian product problem - it works, but is impractically slow with my large datasets.

proc sql;
create table match as
select a.id1, a.id2, varmast, b.varadd
from master a
left join addition b
on (a.id1=b.id1 and a.id2=b.id2) or (a.id2=b.id2 and b.id2 is not null);
quit;

Option B: I'm thinking maybe merge on the first identifier and then use proc sql update to update from the Addition variables using the second identifier? I'm not sure of the syntax.

Option C: I could see probably doing this with a few regular merges & then appending and deduping, but as this would probably take 5+ steps and each step takes a few minutes to run (on a good day) am hoping for something shorter.

Answer 1

I suspect that two left joins are what you want . . . and it should have better performance. The result is something like this:

proc sql;
create table match as
    select m.id1, a.id2, varmast, coalesce(a.varadd, a2.varadd) as varadd
    from master m left join
         addition a
         on m.id1 = a.id1 and m.id2 = a.id2 left join
         addition a2
         on m.id1 = a2.id1 and m.id2 is null and a.id1 is null
quit;

How to merge two SAS datasets on one of two possible variables?

1 个答案: