Crosstable在SAS中显示N个变量的频率组合

时间:2014-11-25 13:21:59

标签: sas

我得到了什么:

  • SAS中的20行表(最初为100k)
  • 各种二进制属性(列)

我想要得到的东西:

  • 显示属性组合频率的交叉
像这样:

          Attribute1    Attribute2  Attribute3  Attribute4
Attribute1    5              0          1            2
Attribute2    0              3          0            3
Attribute3    2              0          5            4
Attribute4    1              2          0            10

*组合的实际总和,可能不是100%逻辑

我目前的代码:

    /*create dummy data*/

    data monthly_sales (drop=i);
        do i=1 to 20;
            Attribute1=rand("Normal")>0.5;
            Attribute2=rand("Normal")>0.5;
            Attribute3=rand("Normal")>0.5;
            Attribute4=rand("Normal")>0.5;
            output;
        end;
    run;

3 个答案:

答案 0 :(得分:1)

我想这可以做得更聪明,但这似乎有效。首先,我创建了一个应该包含所有频率的表:

data crosstable;
  Attribute1=.;Attribute2=.;Attribute3=.;Attribute4=.;output;output;output;output;
run;

然后我遍历所有组合,将计数插入到crosstable:

%macro lup();
%do i=1 %to 4;
  %do j=&i %to 4;
    proc sql noprint;
      select count(*) into :Antall&i&j
      from monthly_sales (where=(Attribute&i and Attribute&j));
    quit;
    data crosstable;
      set crosstable;
      if _n_=&j then Attribute&i=&&Antall&i&j;
      if _n_=&i then Attribute&j=&&Antall&i&j;
    run;
  %end;
%end;
%mend;
%lup;

请注意,由于(i,j)=(j,i)的频率计数,您不需要同时执行这两项操作。

答案 1 :(得分:1)

我建议使用内置的SAS工具进行此类操作,并且可能会略微不同地显示您的数据,除非您真的需要对角线表。 e.g。

   data monthly_sales (drop=i);
        do i=1 to 20;
            Attribute1=rand("Normal")>0.5;
            Attribute2=rand("Normal")>0.5;
            Attribute3=rand("Normal")>0.5;
            Attribute4=rand("Normal")>0.5;
            count = 1;
            output;
        end;
    run;

proc freq data = monthly_sales noprint;
    table  attribute1 * attribute2 * attribute3 * attribute4 / out = frequency_table;
run;

proc summary nway data = monthly_sales;
    class attribute1 attribute2 attribute3 attribute4;
    var count;
    output out = summary_table(drop = _TYPE_ _FREQ_) sum(COUNT)= ;
run;

对于数据中的每个属性贡献,这些中的任何一个都会为您提供一行表格,这与您请求的内容略有不同,但会传达相同的信息。您可以使用proc摘要语句中的completetypes选项强制proc摘要包含数据中不存在的类变量组合的行。

如果您在SAS中进行统计分析,那么绝对值得花时间熟悉proc摘要 - 您可以包含额外的输出统计信息,并以最少的额外代码和处理开销处理多个变量。 / p>

更新:虽然是一个相当复杂的过程,但可以在不使用宏逻辑的情况下生成所需的表格:

proc summary data = monthly_sales completetypes;
    ways 1 2; /*Calculate only 1 and 2-way summaries*/
    class attribute1 attribute2 attribute3 attribute4;
    var count;
    output out = summary_table(drop = _TYPE_ _FREQ_) sum(COUNT)= ;
run;

/*Eliminate unnecessary output rows*/
data summary_table;
    set summary_table;
    array a{*} attribute:;
    sum = sum(of a[*]);
    missing = 0;
    do i = 1 to dim(a);
        missing + missing(a[i]);
        a[i] = a[i] * count;
    end;
    /*We want rows where two attributes are both 1 (sum = 2),
        or one attribute is 1 and the others are all missing*/
    if sum = 2 or (sum = 1 and missing = dim(a) - 1);
    drop i missing sum;
    edge = _n_;
run;

/*Transpose into long format - 1 row per combination of vars*/
proc transpose data = summary_table out = tr_table(where = (not(missing(col1))));
    by edge;
    var attribute:;
run;

/*Use cartesian join to produce table containing desired frequencies (still not in the right shape)*/
option linesize = 150;
proc sql noprint _method _tree;
    create table diagonal as
        select  a._name_ as aname, 
                        b._name_ as bname,
                        a.col1 as count
        from tr_table a, tr_table b
            where a.edge = b.edge
            group by a.edge
            having (count(a.edge) = 4 and aname ne bname) or count(a.edge) = 1
            order by aname, bname
            ;
quit;

/*Transpose the table into the right shape*/
proc transpose data = diagonal out = want(drop = _name_);
    by aname;
    id bname;
    var count;
run;

/*Re-order variables and set missing values to zero*/
data want;
    informat aname attribute1-attribute4;
    set want;
    array a{*} attribute:;
    do i = 1 to dim(a);
        a[i] = sum(a[i],0);
    end;
    drop i;
run;

答案 2 :(得分:0)

是的,user667489是对的,我刚添加了一些额外的代码,以使交叉频率表看起来不错。首先,我创建了一个包含1000万行和10个变量的表:

data monthly_sales (drop=i);
        do i=1 to 10000000;
            Attribute1=rand("Normal")>0.5;
            Attribute2=rand("Normal")>0.5;
            Attribute3=rand("Normal")>0.5;
            Attribute4=rand("Normal")>0.5;
            Attribute5=rand("Normal")>0.5;
            Attribute6=rand("Normal")>0.5;
            Attribute7=rand("Normal")>0.5;
            Attribute8=rand("Normal")>0.5;
            Attribute9=rand("Normal")>0.5;
            Attribute10=rand("Normal")>0.5;
            output;
        end;
    run;

创建一个空的10x10 crosstable:

data crosstable;
  Attribute1=.;Attribute2=.;Attribute3=.;Attribute4=.;Attribute5=.;Attribute6=.;Attribute7=.;Attribute8=.;Attribute9=.;Attribute10=.;
  output;output;output;output;output;output;output;output;output;output;
run;

使用proc freq创建频率表:

proc freq data = monthly_sales noprint;
    table  attribute1 * attribute2 * attribute3 * attribute4 * attribute5 * attribute6 * attribute7 * attribute8 * attribute9 * attribute10
            / out = frequency_table;
run;

循环遍历所有属性组合并总结" count"变量。将其插入crosstable:

%macro lup();
%do i=1 %to 10;
  %do j=&i %to 10;
    proc sql noprint;
      select sum(count) into :Antall&i&j
      from frequency_table (where=(Attribute&i and Attribute&j));
    quit;
    data crosstable;
      set crosstable;
      if _n_=&j then Attribute&i=&&Antall&i&j;
      if _n_=&i then Attribute&j=&&Antall&i&j;
    run;
  %end;
%end;
%mend;
%lup;