所以我拥有的是一个数据集,每个城市和州都在一列中。另一组数据在一列中还具有城市和州,但某些城市被合并。例如:
第一个数据集将具有:
CITY STATE POPULATION
Cape Coral Fl 1000000
Fort Myers FL 2000000
Gainesville FL 100000
第二个数据集将具有:
CITY STATE EMPLOYMENT
Cape Coral - Fort Myers FL 900
Gainesville FL 1000
我曾考虑过进行一场“模糊”比赛,但是对于那些有连字号的城市,我将无法获得全部人口。我可以尝试将有联系的城市拆散,然后将就业人数减半,但我不知道该怎么做。
我希望有一个我从未想到过的简单解决方案。我继续进行了CITY STATE的传统合并,但它只匹配了我数据集的一半。
谢谢!
答案 0 :(得分:0)
如果进行一些假设,例如第二个数据集可以用破折号(-)分隔并且状态始终是最后一个,则第二个数据集可以分成更多行。
data two;
length city_state $100;
input CITY_STATE & EMPLOYMENT;
datalines;
Cape Coral - Fort Myers FL 900
Gainesville FL 1000
run;
data two_b;
length city_state_item $100;
set two;
state = scan (city_state, -1, ' ');
p = find (city_state, trim(state), -101);
city_state_base = substr(city_state,1,p-1);
do _n_ = 1 by 1 while (scan(city_state_base,_n_,'-') ne '');
city_state_item = catx (' ', scan(city_state_base,_n_,'-'), state);
OUTPUT;
employment = 0;
end;
drop p city_state_base state;
run;
拆分后,您将必须将ONE.city_state
与TWO_B.city_state_item
进行匹配,并根据如何重新汇总匹配的数据或将其用于计算人口的某些就业情况,来处理如何拆分或不拆分就业比率。
答案 1 :(得分:0)
做出一些假设,认为该解决方案可以起作用:
data a;
length city_state $100;
input CITY_STATE & POPULATION;
datalines;
Cape Coral Fl 1000000
Fort Myers FL 2000000
Gainesville FL 100000
run;
data b;
length city_state $100;
input CITY_STATE & EMPLOYMENT;
datalines;
Cape Coral - Fort Myers FL 900
Gainesville FL 1000
Run;
Proc sql;
select a.city_state, b.city_state, a.population, case when b.city_state contains '-' then b.EMPLOYMENT /2 else b.EMPLOYMENT End as EMPLOYMENT from a
inner join b
on b.city_state contains substr(a.city_state,1,length(a.city_state)-length(scan(a.city_state,-1,' ')));
quit;
结果:
city_state | city_state |POPULATION |EMPLOYMENT
------------------------------------------------------------------------
Cape Coral Fl | Cape Coral - Fort Myers FL | 1000000 | 450
Fort Myers FL | Cape Coral - Fort Myers FL | 2000000 | 450
Gainesville FL | Gainesville FL | 100000 | 1000
假设每个带有-的city_state都包含两个城市,则可以将其减半
如果b.city_state包含“-”,则b.EMPLOYMENT / 2 else b.EMPLOYMENT以EMPLOYMENT结尾
假设每个city_state以short状态结尾,则可以删除该州并执行contains语句:
b.city_state包含substr(a.city_state,1,length(a.city_state)-length(scan(a.city_state,-1,'')));