需要有关如何合并具有相似但不相同的联合id的数据集的建议

时间:2019-02-16 02:53:06

标签: sas

所以我拥有的是一个数据集,每个城市和州都在一列中。另一组数据在一列中还具有城市和州,但某些城市被合并。例如:

第一个数据集将具有:

CITY STATE          POPULATION
Cape Coral Fl       1000000    
Fort Myers FL       2000000    
Gainesville FL      100000

第二个数据集将具有:

CITY STATE                    EMPLOYMENT    
Cape Coral - Fort Myers FL    900    
Gainesville FL                1000

我曾考虑过进行一场“模糊”比赛,但是对于那些有连字号的城市,我将无法获得全部人口。我可以尝试将有联系的城市拆散,然后将就业人数减半,但我不知道该怎么做。

我希望有一个我从未想到过的简单解决方案。我继续进行了CITY STATE的传统合并,但它只匹配了我数据集的一半。

谢谢!

2 个答案:

答案 0 :(得分:0)

如果进行一些假设,例如第二个数据集可以用破折号(-)分隔并且状态始终是最后一个,则第二个数据集可以分成更多行。

data two;
  length city_state $100;
  input CITY_STATE & EMPLOYMENT;
datalines;
Cape Coral - Fort Myers FL    900    
Gainesville FL                1000
run;

data two_b;
  length city_state_item $100;
  set two;
  state = scan (city_state, -1, ' ');
  p = find (city_state, trim(state), -101);
  city_state_base = substr(city_state,1,p-1);
  do _n_ = 1 by 1 while (scan(city_state_base,_n_,'-') ne '');
    city_state_item = catx (' ', scan(city_state_base,_n_,'-'), state);
    OUTPUT;
    employment = 0;
  end;
  drop p city_state_base state;
run;

拆分后,您将必须将ONE.city_stateTWO_B.city_state_item进行匹配,并根据如何重新汇总匹配的数据或将其用于计算人口的某些就业情况,来处理如何拆分或不拆分就业比率。

答案 1 :(得分:0)

做出一些假设,认为该解决方案可以起作用:

data a;
  length city_state $100;
  input CITY_STATE & POPULATION;
  datalines;
  Cape Coral Fl       1000000    
  Fort Myers FL       2000000    
  Gainesville FL      100000
run;

data  b;
  length city_state $100;
  input CITY_STATE & EMPLOYMENT;
  datalines;
  Cape Coral - Fort Myers FL    900    
  Gainesville FL                1000
Run;

Proc sql;
select a.city_state, b.city_state, a.population, case when b.city_state contains '-' then b.EMPLOYMENT /2 else b.EMPLOYMENT End as EMPLOYMENT from a
 inner join b 
on b.city_state contains substr(a.city_state,1,length(a.city_state)-length(scan(a.city_state,-1,' ')));
quit;

结果:

city_state     | city_state                 |POPULATION |EMPLOYMENT 
------------------------------------------------------------------------
Cape Coral Fl  | Cape Coral - Fort Myers FL | 1000000   |  450 
Fort Myers FL  | Cape Coral - Fort Myers FL | 2000000   |  450 
Gainesville FL | Gainesville FL             | 100000    | 1000 

假设每个带有-的city_state都包含两个城市,则可以将其减半

  

如果b.city_state包含“-”,则b.EMPLOYMENT / 2 else b.EMPLOYMENT以EMPLOYMENT结尾

假设每个city_state以short状态结尾,则可以删除该州并执行contains语句:

  

b.city_state包含substr(a.city_state,1,length(a.city_state)-length(scan(a.city_state,-1,'')));