我正在尝试使用平坦概率开发一个递归程序来丢失字符串值(例如,如果变量有三个可能的值而一个观察值丢失,则缺失的观察将有33%被替换为任何值) 。
注意:本文的目的不是讨论插补技术的优点。
DATA have;
INPUT id gender $ b $ c $ x;
CARDS;
1 M Y . 5
2 F N . 4
3 N Tall 4
4 M Short 2
5 F Y Tall 1
;
/* Counts number of categories i.e. 2 */
proc sql;
SELECT COUNT(Unique(gender)) into :rescats
FROM have
WHERE Gender ~= " " ;
Quit;
%let rescats = &rescats;
%put &rescats; /*internal check */
/* Collects response categories separated by commas i.e. F,M */
proc sql;
SELECT UNIQUE gender into :genders separated by ","
FROM have
WHERE Gender ~= " "
GROUP BY Gender;
QUIT;
%let genders = &genders;
%put &genders; /*internal check */
/* Counts entries to be evaluated. In this case observations 1 - 5 */
/* Note CustomerKey is an ID variable */
proc sql;
SELECT COUNT (UNIQUE(customerKey)) into :ID
FROM have
WHERE customerkey < 6;
QUIT;
%let ID = &ID;
%put &ID; /*internal check */
data want;
SET have;
DO i = 1 to &ID; /* Control works from 1 to 5 */
seed = 12345;
/* Sets u to rand value between 0.00 and 1.00 */
u = RanUni(seed);
/* Sets rand gender to either 1 and 2 */
RandGender = (ROUND(u*(&rescats - 1)) + 1)*1;
/* PROBLEM Should if gender is missing set string value of M or F */
IF gender = ' ' THEN gender = SCAN(&genders, RandGender, ',');
END;
RUN;
我SCAN功能不会在性别中创建F或M观察。它似乎也创建了一个新的M和F变量。此外,DO循环在CustomerKey中创建附加条目。有没有办法摆脱这些?
我更愿意使用循环和宏来解决这个问题。我还不熟练使用数组。
答案 0 :(得分:1)
这是我试图整理一下这个:
/*Changed to delimited input so that values end up in the right columns*/
DATA have;
INPUT id gender $ b $ c $ x;
infile cards dlm=',';
CARDS;
1,M,Y, ,5
2,F,N, ,4
3, ,N,Tall,4
4,M, ,Short,2
5,F,Y,Tall,1
;
/*Consolidated into 1 proc, addded noprint and removed unnecessary group by*/
proc sql noprint;
/* Counts number of categories i.e. 2 */
SELECT COUNT(unique(gender)) into :rescats
FROM have
WHERE not(missing(Gender));
/* Collects response categories separated by commas i.e. F,M */
SELECT unique gender into :genders separated by ","
FROM have
WHERE not(missing(Gender))
;
Quit;
/*Removed redundant %let statements*/
%put rescats = &rescats; /*internal check */
%put genders = &genders; /*internal check */
/*Removed ID list code as it wasn't making any difference to the imputation in this example*/
data want;
SET have;
seed = 12345;
/* Sets u to rand value between 0.00 and 1.00 */
u = RanUni(seed);
/* Sets rand gender to either 1 or 2 */
RandGender = ROUND(u*(&rescats - 1)) + 1;
IF missing(gender) THEN gender = SCAN("&genders", RandGender, ','); /*Added quotes around &genders to prevent SAS interpreting M and F as variable names*/
RUN;
答案 1 :(得分:1)
Halo8:
/*Changed to delimited input so that values end up in the right columns*/
DATA have;
INPUT id gender $ b $ c $ x;
infile cards dlm=',';
CARDS;
1,M,Y, ,5
2,F,N, ,4
3, ,N,Tall,4
4,M, ,Short,2
5,F,Y,Tall,1
;
run;
因此这也有效:
/*Changed to delimited input so that values end up in the right columns*/
DATA have;
INPUT id gender $ b $ c $ x;
DATALINES;
1 M Y . 5
2 F N . 4
3 . N Tall 4
4 M . Short 2
5 F Y Tall 1
;
run;
有很多方法可以判断不同的值:SORTING + FIRST。,Proc FREQ,DATA Step HASH,SQL等等。
提示:有时需要将数据移动到代码回复数据的解决方案,但可能很麻烦。通常最干净的方法是让数据保持数据。
例如:如果连接的不同值需要超过64K ,则INTO将是错误的方法
提示:数据到代码对于连续值和其他值在代码变得完全相同时尤其麻烦。
例如:高精度数值,带控制字符的字符串,带嵌入引号的字符串等......
这是一种使用SQL的方法。如前所述,Proc SURVEYSELECT对于实际应用来说要好得多。
Proc SQL;
Create table REPLACEMENTS as select distinct gender from have where gender is NOT NULL;
%let REPLACEMENT_COUNT = &SQLOBS; %* Tip: Take advantage of automatic macro variable SQLOBS;
data REPLACEMENTS;
set REPLACEMENTS;
rownum+1; * rownum needed for RANUNI matching;
run;
Proc SQL;
* Perform replacement of missing values;
Update have
set gender =
(
select gender
from REPLACEMENTS
where rownum = ceil(&REPLACEMENT_COUNT * ranuni(1234))
)
where gender is NULL
;
%let SYSLAST = have;
DM 'viewtable have' viewtable;
您不必担心没有缺失值的列,因为这些列不会发生替换。对于缺少列的候选列表,REPLACEMENTS排除缺失,REPLACEMENT_COUNT对于计算统一替换概率是正确的,1 / COUNT,编码为rownum = ceil(随机)