SAS SCAN功能和缺失值

时间:2017-11-04 01:00:29

标签: sas missing-data

我正在尝试使用平坦概率开发一个递归程序来丢失字符串值(例如,如果变量有三个可能的值而一个观察值丢失,则缺失的观察将有33%被替换为任何值) 。

注意:本文的目的不是讨论插补技术的优点。

DATA have;
    INPUT id gender $ b $ c $ x; 
    CARDS; 
    1 M Y . 5 
    2 F N . 4 
    3   N Tall 4 
    4 M   Short 2 
    5 F Y Tall 1
    ;

/* Counts number of categories i.e. 2 */
proc sql; 
    SELECT COUNT(Unique(gender)) into :rescats 
    FROM have 
    WHERE Gender ~= " " ;
    Quit;

%let rescats = &rescats; 
%put &rescats; /*internal check */

/* Collects response categories separated by commas i.e. F,M */
proc sql; 
    SELECT UNIQUE gender into :genders separated by ","
    FROM have
    WHERE Gender ~= " "
    GROUP BY Gender;
    QUIT;

%let genders = &genders;
%put &genders;  /*internal check */

/* Counts entries to be evaluated. In this case observations 1 - 5 */
/* Note CustomerKey is an ID variable */
proc sql; 
    SELECT COUNT (UNIQUE(customerKey)) into :ID
    FROM have
    WHERE customerkey < 6;
QUIT;

%let ID = &ID;
%put &ID; /*internal check */

data want; 
SET have;
DO i = 1 to &ID; /* Control works from 1 to 5 */
seed = 12345; 
/* Sets u to rand value between 0.00 and 1.00 */
u = RanUni(seed);  
/* Sets rand gender to either 1 and 2 */
RandGender = (ROUND(u*(&rescats - 1)) + 1)*1; 
/* PROBLEM Should if gender is missing set string value of M or F */
IF gender = ' ' THEN gender = SCAN(&genders, RandGender, ','); 
END;
RUN;

我SCAN功能不会在性别中创建F或M观察。它似乎也创建了一个新的M和F变量。此外,DO循环在CustomerKey中创建附加条目。有没有办法摆脱这些?

我更愿意使用循环和宏来解决这个问题。我还不熟练使用数组。

2 个答案:

答案 0 :(得分:1)

这是我试图整理一下这个:

/*Changed to delimited input so that values end up in the right columns*/
DATA have;
INPUT id gender $ b $ c $ x;
infile cards dlm=',';
CARDS; 
1,M,Y, ,5
2,F,N, ,4
3, ,N,Tall,4
4,M, ,Short,2
5,F,Y,Tall,1
;

/*Consolidated into 1 proc, addded noprint and removed unnecessary group by*/
proc sql noprint; 
    /* Counts number of categories i.e. 2 */
    SELECT COUNT(unique(gender)) into :rescats 
    FROM have 
    WHERE not(missing(Gender));
    /* Collects response categories separated by commas i.e. F,M */    
    SELECT unique gender into :genders separated by ","
    FROM have
    WHERE not(missing(Gender))
    ;   
Quit;
/*Removed redundant %let statements*/
%put rescats = &rescats; /*internal check */
%put genders = &genders;  /*internal check */

/*Removed ID list code as it wasn't making any difference to the imputation in this example*/

data want; 
SET have;
seed = 12345; 
/* Sets u to rand value between 0.00 and 1.00 */
u = RanUni(seed);  
/* Sets rand gender to either 1 or 2 */
RandGender = ROUND(u*(&rescats - 1)) + 1; 
IF missing(gender) THEN gender = SCAN("&genders", RandGender, ',');  /*Added quotes around &genders to prevent SAS interpreting M and F as variable names*/
RUN;

答案 1 :(得分:1)

Halo8:

/*Changed to delimited input so that values end up in the right columns*/
DATA have;
INPUT id gender $ b $ c $ x;
infile cards dlm=',';
CARDS; 
1,M,Y, ,5
2,F,N, ,4
3, ,N,Tall,4
4,M, ,Short,2
5,F,Y,Tall,1
;
run;
  • 提示:在INPUT期间,您可以使用点(。)表示字符变量的缺失值。
  • 提示:DATALINES是CARDS的现代替代品。
  • 提示:数据值不必排成一行,但它可以帮助人类。

因此这也有效:

/*Changed to delimited input so that values end up in the right columns*/
DATA have;
INPUT id gender $ b $ c $ x;
DATALINES; 
1 M Y .     5
2 F N .     4
3 . N Tall  4
4 M . Short 2
5 F Y Tall  1
;
run;
  • 提示:您的技术需要两次传递数据。
    • 一个确定不同的值。
    • 第二个应用你的估算。
    • 大多数方法每个变量处理需要两次传递。哈希方法只能进行两次传递但需要更多内存。

有很多方法可以判断不同的值:SORTING + FIRST。,Proc FREQ,DATA Step HASH,SQL等等。

提示:有时需要将数据移动到代码回复数据的解决方案,但可能很麻烦。通常最干净的方法是让数据保持数据。

例如:如果连接的不同值需要超过64K ,则INTO将是错误的方法

提示:数据到代码对于连续值和其他值在代码变得完全相同时尤其麻烦。

例如:高精度数值,带控制字符的字符串,带嵌入引号的字符串等......

这是一种使用SQL的方法。如前所述,Proc SURVEYSELECT对于实际应用来说要好得多。

Proc SQL;
  Create table REPLACEMENTS as select distinct gender from have where gender is NOT NULL;
  %let REPLACEMENT_COUNT = &SQLOBS;  %* Tip: Take advantage of automatic macro variable SQLOBS;

data REPLACEMENTS;
  set REPLACEMENTS;
  rownum+1; * rownum needed for RANUNI matching;
run;

Proc SQL;
  * Perform replacement of missing values;
  Update have
    set gender = 
      (
        select gender 
        from REPLACEMENTS
        where rownum =  ceil(&REPLACEMENT_COUNT * ranuni(1234))
      )
    where gender is NULL
  ;

%let SYSLAST = have;
DM 'viewtable have' viewtable;

您不必担心没有缺失值的列,因为这些列不会发生替换。对于缺少列的候选列表,REPLACEMENTS排除缺失,REPLACEMENT_COUNT对于计算统一替换概率是正确的,1 / COUNT,编码为rownum = ceil(随机)