I am trying to merge two large (million+) datasets in SAS. I'm pretty new to SAS and this is my first stackexchange question so hopefully the following makes sense...
SETUP:
All observations in the "Master" dataset have a unique identifier var1 and some have unique identifier var2. Some observations in the "Addition" dataset have unique identifier var1 and some have unique identifier var2; some observations have var2 but not var2.
I want to merge in all matches from the Addition dataset on EITHER var1 or var2 into the Master dataset.
METHODS I HAVE EXPLORED:
Option A: proc sql left join on var1 OR var2. Unfortunately, because there are multiple missing observations on var2 in both Master and Addition this runs into a Cartesian product problem - it works, but is impractically slow with my large datasets.
proc sql;
create table match as
select a.id1, a.id2, varmast, b.varadd
from master a
left join addition b
on (a.id1=b.id1 and a.id2=b.id2) or (a.id2=b.id2 and b.id2 is not null);
quit;
Option B: I'm thinking maybe merge on the first identifier and then use proc sql update to update from the Addition variables using the second identifier? I'm not sure of the syntax.
Option C: I could see probably doing this with a few regular merges & then appending and deduping, but as this would probably take 5+ steps and each step takes a few minutes to run (on a good day) am hoping for something shorter.
答案 0 :(得分:2)
I suspect that two left joins are what you want . . . and it should have better performance. The result is something like this:
proc sql;
create table match as
select m.id1, a.id2, varmast, coalesce(a.varadd, a2.varadd) as varadd
from master m left join
addition a
on m.id1 = a.id1 and m.id2 = a.id2 left join
addition a2
on m.id1 = a2.id1 and m.id2 is null and a.id1 is null
quit;