SAS中的自动分组,最大限度地减少组内的差异

时间:2015-03-21 06:36:44

标签: sas cluster-analysis datastep

所以我尝试构建自动分组。目标是选择方差最小的分组设置。

换句话说,我想找到以下的x和y,x,y是自然数,

GROUP 1: 1997 - x
GROUP 2: x+1 - y
GROUP 3: y+1 - 1994

使得(组1中的方差(Response),方差(组2中的Response),方差(组3中的Response)的总和最小化。

enter image description here

data maindat;
input  Year Response ;
datalines;
1994    -4.300511714
1994    -9.646920963
1994    -15.86956805
1993    -16.14857235
1993    -13.05797186
1993    -13.80941206
1992    -3.521394503
1992    -1.102526302
1992    -0.137573583
1992    2.669238665
1992    -9.540489193
1992    -19.27474303
1992    -3.527077011
1991    1.676464068
1991    -2.238822314
1991    4.663079037
1991    -5.346920963
1990    -8.543723186
1990    0.507460641
1990    0.995302284
1990    0.464194011
1989    4.728791571
1989    5.578685423
1988    2.771297564
1988    7.109159247
1987    15.96059456
1987    2.985292226
1986    -4.301136971
1985    5.854674875
1985    5.797294021
1984    4.393329025
1983    -6.622580905
1982    0.268500302
1977    12.23062252
;
run;

我的想法是,我有2个循环(嵌套)

1st do loop (1st iteration): Group 1    1977 - 1977    1977 - 1977   1977 - 1977    …   1977 - 1977
2nd do loop:                 Group 2    1978 - 1978    1978 - 1979   1978 - 1980    …   1978 - 1993
Else:                        Group 3    1979 - 1994    1980 - 1994   1981 - 1994    …   1994 - 1994
1st do loop (2nd iteration): Group 1    1977 - 1978    1977 - 1978   1977 - 1978    …   1977 - 1978
2nd do loop:                 Group 2    1979 - 1979    1979 - 1980   1979 - 1981    …   1979 - 1993
Else                         Group 3    1980 - 1994    1981 - 1994   1982 - 1994    …   1994 - 1994
...
1st do loop (n-1th iteration) Group 1   1977 - 1991   1977 - 1991           
2nd do loop:                  Group 2   1992 - 1992   1992 - 1993           
Else                          Group 3   1993 - 1994   1994 - 1994           
1st do loop (nth iteration)   Group 1   1977 - 1992             
2nd do loop:                  Group 2   1993 - 1993             
Else                          Group 3   1994 - 1994             

然后,我只选择分组设置,该设置提供3组中方差总和(组内响应)的最小值。

2 个答案:

答案 0 :(得分:1)

这是一种手动,详尽的方法。这应该如上所述解决您的问题,但如果您想要更多组或拥有更大的数据,则不是解决问题的好方法。

我确信有一种更明智的方法可以使用其中一种触发器,但是没有任何东西可以立即浮现在脑海中。

/* Get the year bounds */
proc sql noprint;
    select min(year), max(year)
    into :yMin, :yMax
    from maindat;
quit;

/* Get all the boundaries */
data cutoffs;
    do min = &yMin. to &yMax.;
        do max = min + 1 to &yMax. + 1;
            output;
        end;
    end;
run;
proc sql;
    /* Calculate all the variances */
    create table vars as
    select 
        a.*,
        var(b.Response) as var
    from cutoffs as a
    left join maindat as b
        on a.min <= b.year < a.max
    group by a.min, a.max;

    /* Get the sum of the variances for each set of 3 groups */
    create table want as
    select 
        a.min as a,
        b.min as b,
        c.min as c,
        c.max as d,
        sum(a.var, b.var, c.var) as sumVar
    from vars as a
    left join vars as b
        on a.max = b.min
    left join vars as c
        on b.max = c.min
    where a.min = &yMin. and c.max = &yMax. and a.var and b.var and c.var
    order by a.min, b.min, c.min;

    /* Output your answer (combine with previous step if you don't want the list) */
    select * 
    from want
    where sumVar in (select min(sumVar) from want);
quit;

答案 1 :(得分:0)

SRSwift的答案可能是您提供的问题的最佳答案。使用标准算法的困难在于,您似乎没有单一的局部/全局函数最小值(响应的方差),但是有多个局部最小值导致它在相对较低的情况下不能很好地工作它具有灵活性,可以调整数据密度。如果你有很多年的话,这种事情很容易解决,你可以在一年中跳过一年左右,跳过五年或十年或者其他什么(避免局部最小值) );但只有几十年是不切实际的。

这是一个核心机器学习应用程序,集群节点的能力,并有许多解决方案。你特别喜欢的那个似乎吸引了我在几年前的一门课程中学到的最简单的一个,如果你想到它就会很容易实现。

  1. 定义要最小化的函数,例如minim_f。
  2. 定义一个获取数据的函数,在一个方向上修改一个质心的聚类质心(或任何定义聚类的方法),比如modif_f。 (质心和方向应该是参数。)
  3. 然后交替调用minim_f和modif_f;你调用minim_f,抓住它的值,用一组参数调用modif_f;然后检查minim_f并查看它是否更好。如果是这样,继续朝那个方向前进。如果没有,请恢复到上一次迭代的原始值,并尝试在modif_f中进行不同的修改。继续前行,直到找到当地最低标准,这有望成为全球最低标准。

    这种确切的机制各不相同;特别是,您可以一次调整一个或多个质心,并且必须找出正确的方法来继续调整,直到不再进行调整为止。

    我为你的数据写了一个小例子;它确实得到了与SRSwift相同的答案,尽管proc意味着计算的方差与SRSwift程序的方差不同。我不是一个统计学家,也不会说哪个是正确的,但是他们显然工作得非常相似,以至于它并不重要。我的这是一个非常简单的实现,并将从改进中受益匪浅,但希望它能解释基本概念。

    data maindat;
    input  Year Response ;
    datalines;
    1994    -4.300511714
    1994    -9.646920963
    1994    -15.86956805
    1993    -16.14857235
    1993    -13.05797186
    1993    -13.80941206
    1992    -3.521394503
    1992    -1.102526302
    1992    -0.137573583
    1992    2.669238665
    1992    -9.540489193
    1992    -19.27474303
    1992    -3.527077011
    1991    1.676464068
    1991    -2.238822314
    1991    4.663079037
    1991    -5.346920963
    1990    -8.543723186
    1990    0.507460641
    1990    0.995302284
    1990    0.464194011
    1989    4.728791571
    1989    5.578685423
    1988    2.771297564
    1988    7.109159247
    1987    15.96059456
    1987    2.985292226
    1986    -4.301136971
    1985    5.854674875
    1985    5.797294021
    1984    4.393329025
    1983    -6.622580905
    1982    0.268500302
    1977    12.23062252
    ;
    run;
    
    proc sort data=maindat;
      by year;
    run;
    
    proc freq data=maindat;            * Start us off with a frequency table by year.;
    tables year/out=yearfreq outcum;
    run;
    
    data initial_clusters;             * Guess that the best starting point is 1/3 of the years for each cluster.;
      set yearfreq;
      cluster = floor(cum_pct/33.334)+1;
    run;
    
    
    
    data cluster_years;                * Merge on the clusters;
      merge maindat initial_clusters(keep=year cluster);
      by year;
    run;
    
    proc means data=cluster_years;     * And get that starting variance.;
      class cluster;
      types cluster;
      var response;
      output out=cluster_var var=;
    run;
    
    data cluster_var_tot;              * Create our starting 'cumulative' file of variances;
      set cluster_var end=eof;
      total_var+response;
      iter=1;
      if eof then output;
      keep total_var iter;
    run;
    
    
    data current_clusters;             * And initialize the current cluster estimate to the initial clusters;
      set initial_clusters;
    run;
    
    
                                       * Here is our recursive cluster-testing macro.;
    %macro try_cluster(cluster_adj=, cluster_new=,iter=1);
    /* Here I include both MODIF_F and MINIM_F, largely because variable scoping is irritating if I separate them. */
    /* But you can easily swap out the MINIM_F portion if needed to a different minimization function. */
    
    /* This is MODIF_F, basically */
    data adjusted_clusters;
      set current_clusters;
      by cluster;
      %if &cluster_adj. < &cluster_new. %then %do;
        if last.cluster 
      %end;
      %else %do;
        if first.cluster
      %end;
        and cluster=&cluster_adj. then cluster=&cluster_new.;
    run;
    
    data cluster_years;
      merge maindat adjusted_clusters(keep=year cluster);
      by year;
    run;
    /* end MODIF_F */
    
    /* This would be MINIM_F if it were a function of its own */
    proc means data=cluster_years noprint;   *Calculate variance by cluster;
      class cluster;
      types cluster;
      var response;
      output out=cluster_var var=;
    run;
    
    
    data cluster_var_tot;                    
      set cluster_var_tot cluster_var indsname=dsn end=eof;
      retain last_var last_iter;
      if dsn='WORK.CLUSTER_VAR_TOT' then do;  *Keep the old cluster variances for history;
        output;
        last_var=total_var;
        last_iter=_n_;
      end;
      else do;                                *Sum up the variance for this iteration;
         total_var+response;
         iter=last_iter+1;
         if eof then do;
           if last_var > total_var then smaller=1;   *If it is smaller...;
           else smaller=0;                    
           call symputx('smaller',smaller,'l');      *save smaller to a macro variable;
           if smaller=1 then output;                 *... then output it.;
         end;
      end;
      keep total_var iter;
    run;
    
    /* End MINIM_F */
    
    
    %if &smaller=1 %then %do;                        *If this iteration was better, then keep iterating, otherwise stop;
      data current_clusters;  
        set adjusted_clusters;                       *replace old clusters with better clusters; 
      run;
      %if &iter<10 %then %try_cluster(cluster_adj=&cluster_adj.,cluster_new=&cluster_new.,iter=&iter.+1);
    %end;
    
    %mend try_cluster;
    
    * Let us try a few changes;
    %try_cluster(cluster_adj=1,cluster_new=2,iter=1);
    %try_cluster(cluster_adj=2,cluster_new=1,iter=1);
    %try_cluster(cluster_adj=3,cluster_new=2,iter=1);
    * That was just an example (that happens to work for this data); 
    * This part would be greatly enhanced by some iteration testing and/or data-appropriate modifications;
    
    
    * Now merge back on the 'current' clusters, since the current cluster_years is actually one worse;
    data cluster_years;
      merge maindat current_clusters(keep=year cluster);
      by year;
    run;
    
    * And get the variance just as a verification.;
    proc means data=cluster_years;
      class cluster;
      types cluster;
      var response;
      output out=cluster_var var=;
    run;