填写滚动相关矩阵的缺失值

时间:2014-02-26 13:50:14

标签: sas

这个问题部分与此question有关。

我的数据文件可以找到here。我使用的是从2008年1月1日到2013年12月31日的样本期。数据文件没有缺失值。

以下代码使用前一年价值的滚动窗口,在2008年1月1日至2013年12月31日的每一天生成滚动关联矩阵。例如,2008年1月1日AUTBEL之间的相关性使用2007年1月1日至2008年1月1日的一系列值计算,同样适用于所有其他对。

data work.rolling;
set mm.rolling;
run;

%macro rollingCorrelations(inputDataset=, refDate=);
/*first get a list of unique dates on or after the reference date*/
proc freq data = &inputDataset. noprint;
where date >="&refDate."d;
table date/out = dates(keep = date);
run;


/*for each date calculate what the window range is, here using a year's length*/
data dateRanges(drop = date);
set dates end = endOfFile 
                nobs= numDates;
format toDate fromDate date9.;

toDate=date;
fromDate = intnx('year', toDate, -1, 's');

call symputx(compress("toDate"!!_n_), put(toDate,date9.));
call symputx(compress("fromDate"!!_n_), put(fromDate, date9.) );

/*find how many times(numberOfWindows) we need to iterate through*/
if endOfFile then do;
call symputx("numberOfWindows", numDates);
end;

run;
%do i = 1 %to &numberOfWindows.;
/*create a temporary view which has the filtered data that is passed to PROC CORR*/
data windowedDataview / view = windowedDataview;
set  &inputDataset.;
where date between "&&fromDate&i."d and "&&toDate&i."d;
drop date;
run;
    /*the output dataset from each PROC CORR run will be 
correlation_DDMMMYYY<from date>_DDMMMYY<start date>*/
proc corr data = windowedDataview 
outp = correlations_&&fromDate&i.._&&toDate&i. (where=(_type_ = 'CORR'))

        noprint;
run;

%end;

/*append all datasets into a single table*/
data all_correlations;
format from to date9.;
set correlations_:
     indsname = datasetname
;
from = input(substr(datasetname,19,9),date9.);
to = input(substr(datasetname,29,9), date9.);
run;


%mend rollingCorrelations;
%rollingCorrelations(inputDataset=rolling, refDate=01JAN2008)

可以找到输出的摘录here

可以看出,第2行到第53行显示了2008年4月1日的相关矩阵。但是,2009年4月1日的相关矩阵出现问题:{{1}的相关系数缺失值和它的对。这是因为如果查看数据文件,从2008年4月1日到2009年4月1日的ALPHA的值都为零,因此导致除以零。这种情况也会发生在其他一些数据值上,例如,ALPHA从08年4月1日到4月1日也将所有值都设为0。

为了解决这个问题,我想知道如何修改上面的代码,以便在这种情况发生的情况下(即,所有值在2个特定日期之间为0),那么两对数据值之间的相关性是简单地使用整个样本期计算。例如,HSBCALPHA之间的相关性在09年4月1日缺失,因此这种相关性应使用2008年1月1日至2013年12月31日的值计算,而不是使用2008年4月1日的值到2009年4月1日

1 个答案:

答案 0 :(得分:1)

运行上面的宏并获得all_correlations数据集后,您需要使用所有数据运行另一个PROC CORR,例如,

/*first filter the data to be between "01JAN2008"d and "31DEC2013"d*/
data work.all_data_01JAN2008_31DEC2013;
set mm.rolling;
where date between "01JAN2008"d and "31DEC2013"d;
drop date ;
run;

然后将上述数据集传递给PROC CORR

proc corr data =  work.all_data_01JAN2008_31DEC2013
outp = correlations_01JAN2008_31DEC2013
 (where=(_type_ = 'CORR'))

        noprint;
run;
data correlations_01JAN2008_31DEC2013;
length id 8;
set correlations_01JAN2008_31DEC2013;
/*add a column identifier to make sure the order of the correlation matrix is preserved when joined with other tables*/
id = _n_;
run;

您将获得_name_列唯一的数据集。 然后,您必须以correlations_01JAN2008_31DEC2013加入all_correlations,如果all_correlations中缺少值,则会在其位置插入correlations_01JAN2008_31DEC2013的相应值。为此,我们可以使用PROC SQL&amp; COALESCE函数。

PROC SQL;
CREATE TABLE MISSING_VALUES_IMPUTED AS 
SELECT
A.FROM
,A.TO
,b.id
,a._name_
,coalesce(a.AUT,b.AUT) as AUT
,coalesce(a.BEL,b.BEL) as BEL
,coalesce(a.DEN,b.DEN) as DEN
,coalesce(a.FRA,b.FRA) as FRA
,coalesce(a.GER,b.GER) as GER
,coalesce(a.GRE,b.GRE) as GRE
,coalesce(a.IRE,b.IRE) as IRE
,coalesce(a.ITA,b.ITA) as ITA
,coalesce(a.NOR,b.NOR) as NOR
,coalesce(a.POR,b.POR) as POR
,coalesce(a.SPA,b.SPA) as SPA
,coalesce(a.SWE,b.SWE) as SWE
,coalesce(a.NL,b.NL) as NL
,coalesce(a.ERS,b.ERS) as ERS
,coalesce(a.RZB,b.RZB) as RZB
,coalesce(a.DEX,b.DEX) as DEX
,coalesce(a.KBD,b.KBD) as KBD
,coalesce(a.DAB,b.DAB) as DAB
,coalesce(a.BNP,b.BNP) as BNP
,coalesce(a.CRDA,b.CRDA) as CRDA
,coalesce(a.KN,b.KN) as KN
,coalesce(a.SGE,b.SGE) as SGE
,coalesce(a.CBK,b.CBK) as CBK
,coalesce(a.DBK,b.DBK) as DBK
,coalesce(a.IKB,b.IKB) as IKB
,coalesce(a.ALPHA,b.ALPHA) as ALPHA
,coalesce(a.ALBK,b.ALBK) as ALBK
,coalesce(a.IPM,b.IPM) as IPM
,coalesce(a.BKIR,b.BKIR) as BKIR
,coalesce(a.BMPS,b.BMPS) as BMPS
,coalesce(a.PMI,b.PMI) as PMI
,coalesce(a.PLO,b.PLO) as PLO
,coalesce(a.BINS,b.BINS) as BINS
,coalesce(a.MB,b.MB) as MB
,coalesce(a.UC,b.UC) as UC
,coalesce(a.BCP,b.BCP) as BCP
,coalesce(a.BES,b.BES) as BES
,coalesce(a.BBV,b.BBV) as BBV
,coalesce(a.SCHSPS,b.SCHSPS) as SCHSPS
,coalesce(a.NDA,b.NDA) as NDA
,coalesce(a.SEA,b.SEA) as SEA
,coalesce(a.SVK,b.SVK) as SVK
,coalesce(a.SPAR,b.SPAR) as SPAR
,coalesce(a.CSGN,b.CSGN) as CSGN
,coalesce(a.UBSN,b.UBSN) as UBSN
,coalesce(a.ING,b.ING) as ING
,coalesce(a.SNS,b.SNS) as SNS
,coalesce(a.BARC,b.BARC) as BARC
,coalesce(a.HBOS,b.HBOS) as HBOS
,coalesce(a.HSBC,b.HSBC) as HSBC
,coalesce(a.LLOY,b.LLOY) as LLOY
,coalesce(a.STANBS,b.STANBS) as STANBS
from all_correlations as a
inner join correlations_01JAN2008_31DEC2013 as b
on a._name_ = b._name_
order by
A.FROM
,A.TO
,b.id
;
quit;
/*verify that no missing values are left. NMISS column should be 0 from all variables*/
proc means data = MISSING_VALUES_IMPUTED n nmiss;
run;