Proc hpbin,每个bin的比例最小

时间:2019-04-15 13:08:53

标签: sas nested-loops binning sas-studio

我正在使用Proc HPBIN将我的数据分成相等间隔的存储桶,即每个存储桶在变量总范围中的比例相等。

我的问题是,当我有大范围的非常偏斜的数据时。我几乎所有的数据点都放在一个桶中,而在极端情况附近散布着一些观察结果。

我想知道是否有一种方法可以强制PROC HPBIN考虑每个bin中值的比例,并确保至少存在例如5%的观测值放在垃圾箱中并归类?

DATA var1;
    DO VAR1 = 1 TO 100;
        OUTPUT;
    END;
    DO VAR1 = 500 TO 505;
        OUTPUT;
    END;
    DO VAR1 = 7000 TO 7015;
        OUTPUT;
    END;
    DO VAR1 = 1000000 TO 1000010;
        OUTPUT;
    END;
RUN;

/*Use proc hpbin to generate bins of equal width*/
ODS EXCLUDE ALL;
ODS OUTPUT
    Mapping = bin_width_results;
PROC HPBIN
    DATA=var1
    numbin = 15
    bucket;
    input VAR1 / numbin = 15;
RUN;
ODS EXCLUDE NONE;

Id喜欢看到一种将proc hpbin或其他方法组合在一起的方式,这些方式将空的垃圾箱组合在一起,并允许每个存储桶至少占5%的比例。但是,我不希望在这种情况下使用百分位数(这是我的pdf上的另一幅图),因为我希望看到价差。

2 个答案:

答案 0 :(得分:1)

Quantile选项和20垃圾箱应为每个垃圾箱提供5%的费用

PROC HPBIN DATA=var1 quantile;
    input VAR1 / numbin = 20;
RUN;

当由于仓(问题仓)中比例过高而需要对仓的值进行动态重新绑定时,您只需要hpbin HPBIN在问题仓中的那些值。可以编写宏以在DATA have; DO VAR1 = 1 TO 100; OUTPUT; END; DO VAR1 = 500 TO 505; OUTPUT; END; DO VAR1 = 7000 TO 7015; OUTPUT; END; DO VAR1 = 1000000 TO 1000010; OUTPUT; END; RUN; %macro bin_zoomer (data=, var=, nbins=, rezoom=0.25, zoomlimit=8, out=); %local data_view step nextstep outbins zoomers; proc sql; create view data_zoom1 as select 1 as step, &var from &data; quit; %let step = 1; %let data_view = data_zoom&step; %let outbins = bins_step&step; %bin: %if &step > &zoomlimit %then %goto done; ODS EXCLUDE ALL; ODS OUTPUT Mapping = &outbins; PROC HPBIN DATA=&data_view bucket ; id step; input &var / numbin = &nbins; RUN; ODS EXCLUDE NONE; proc sql noprint; select count(*) into :zoomers trimmed from &outbins where proportion >= &rezoom ; %put NOTE: &=zoomers; %if &zoomers = 0 %then %goto done; %let step = %eval(&step+1); proc sql; create view data_zoom&step as select &step as step, * from &data_view data join &outbins bins on data.&var between bins.LB and bins.UB and bins.proportion >= &rezoom ; quit; %let outbins = bins_step&step; %let data_view = data_zoom&step; %goto bin; %done: %put NOTE: done @ &=step; * stack the bins that are non-problem or of final zoom; * the LB to UB domains from step2+ will discretely cover the bounds * of the original step1 bins; data &out; set bins_step1-bins_step&step indsname = source ; if proportion < &rezoom or source = "bins_step&step"; step = source; run; %mend; options mprint; %bin_zoomer(data=have, var=var1, nbins=15, out=bins); 过程中循环,放大问题区域。

例如:

public class CustomMySqlDialect extends MySQLDialect {
  public CustomMySqlDialect() {
    super();

    registerFunction("REGEX_LIKE", new SQLFunctionTemplate(BOOLEAN, "?1 RLIKE (?2)"));
  }
}

答案 1 :(得分:1)

您是否尝试过使用WINSOR方法(winsorised binning)?来自documentation

  

Winsorized分箱与存储桶分箱类似,不同之处在于,将两条尾巴都切掉以获得平滑的分箱结果。该技术通常在数据准备阶段用于消除异常值。

您可以指定WINSORRATE来影响其调整这些尾巴的方式。