Question

我有一张包含大量缺失值的SAS表。这只是一个简单的例子。真实的表格要大得多（> 1000行）并且数字不一样。但同样的是，我有一个没有缺失数字的列a。列b和c的序列短于a的长度。

我想要的是通过重复序列来填充b，直到它们的列已满。结果应如下所示：

    a   b   c
    1   1b  1000
    2   2b  2000
    3   3b  1000
    4   1b  2000
    5   2b  1000
    6   3b  2000
    7   1b  1000

我试图制作一个宏但它变得很乱。

Answer 1

哈希哈希解决方案在这里是最灵活的，我怀疑。

data have;
infile datalines delimiter="|";
input a b $ c;
datalines;
1|1b|1000
2|2b|2000
3|3b|    
4|  |    
5|  |    
6|  |    
7|  |    
;;;;
run;


%let vars=b c;

data want;
  set have;
  rownum = _n_;
  if _n_=1 then do;
    declare hash hoh(ordered:'a');
    declare hiter hih('hoh');
    hoh.defineKey('varname');
    hoh.defineData('varname','hh');
    hoh.defineDone();

    declare hash hh();

    do varnum = 1 to countw("&vars.");
        varname = scan("&vars",varnum);
        hh = _new_ hash(ordered:'a');
        hh.defineKey("rownum");
        hh.defineData(varname);
        hh.defineDone();
        hoh.replace();
    end;
  end;

  do rc=hih.next() by 0 while (rc=0);
    if strip(vvaluex(varname)) in (" ",".")  then do;
        num_items = hh.num_items;
        rowmod = mod(_n_-1,num_items)+1;
        hh.find(key:rowmod);
    end;
    else do;
      hh.replace();
    end;
    rc = hih.next();
  end;
  keep a &Vars.;
run;

基本上，为您正在使用的每个变量构建一个哈希。它们都被添加到哈希的散列中。然后我们迭代它，并搜索以查看所请求的变量是否已填充。如果是，那么我们将其添加到其哈希。如果不是，那么我们将检索适当的。

Answer 2

假设您可以通过计算列中有多少个非缺失值来告诉每个变量使用多少行，那么您可以使用此代码生成技术生成将使用POINT =选项SET语句的数据步骤循环通过变量X的第一个Nx观测值。

首先获取变量名称列表;

proc transpose data=have(obs=0) out=names ;
  var _all_;
run;

然后使用它们生成PROC SQL select语句来计算每个变量的非缺失值的数量。

filename  code temp ;
data _null_;
  set names end=eof ;
  file code ;
  if _n_=1 then put 'create table counts as select ' ;
  else put ',' @;
  put 'sum(not missing(' _name_ ')) as ' _name_ ;
  if eof then put 'from have;' ;
run;

proc sql noprint;
%include code /source2 ;
quit;

然后换位，这样你的每个变量名都会有一行，但这次它也有COL1中的计数。

proc transpose data=counts out=names ;
  var _all_;
run;

现在使用它来生成DATA步骤所需的SET语句，以便从输入创建输出。

filename code temp;
data _null_;
  set names ;
  file code ;
  length pvar $32 ;
  pvar = cats('_point',_n_);
  put pvar '=mod(_n_-1,' col1 ')+1;' ;
  put 'set have(keep=' _name_ ') point=' pvar ';' ;
run;

现在使用生成的语句。

data want ;
  set have(drop=_all_);
  %include code / source2;
run;

因此，对于包含变量A，B和C以及7个总观察值的示例数据文件，生成数据步骤的LOG如下所示：

1229  data want ;
1230    set have(drop=_all_);
1231    %include code / source2;
NOTE: %INCLUDE (level 1) file CODE is file .../#LN00026.
1232 +_point1 =mod(_n_-1,7 )+1;
1233 +set have(keep=a ) point=_point1 ;
1234 +_point2 =mod(_n_-1,3 )+1;
1235 +set have(keep=b ) point=_point2 ;
1236 +_point3 =mod(_n_-1,2 )+1;
1237 +set have(keep=c ) point=_point3 ;
NOTE: %INCLUDE (level 1) ending.
1238  run;

NOTE: There were 7 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT has 7 observations and 3 variables.

Answer 3

使用值填充临时数组，然后检查行并添加适当的值。

设置数据

data have;
infile datalines delimiter="|";
input a b $ c;
datalines;
1|1b|1000
2|2b|2000
3|3b|    
4|  |    
5|  |    
6|  |    
7|  |    
;

获取非空值的计数

proc sql noprint;
select count(*)
    into :n_b
    from have
    where b ^= "";

select count(*)
    into :n_c
    from have
    where c ^=.;
quit;

现在通过重复每个数组的内容来填充缺失的值。

data want;
set have;
/*Temporary Arrays*/
array bvals[&n_b] $ 32  _temporary_;
array cvals[&n_c]  _temporary_;

if _n_ <= &n_b then do;
    /*Populate the b array*/
    bvals[_n_] = b;
end;
else do;
    /*Fill the missing values*/
    b = bvals[mod(_n_+&n_b-1,&n_b)+1];
end;

if _n_ <= &n_c then do;
    /*populate C values array*/
    cvals[_n_] = c;
end;
else do;
    /*fill in the missing C values*/
    c = cvals[mod(_n_+&n_c-1,&n_c)+1];
end;
run;

Answer 4

data want;
   set have;
   n=mod(_n_,3);
   if n=0 then b='3b';
   else b=cats(n,'b');
   if n in (1,0) then c=1000;
   else c=2000;
   drop n;
run;

通过重复值填充SAS变量

4 个答案: