Question

我的数据是纵向的。

VISIT ID   VAR1
1     001  ...
1     002  ...
1     003  ...
1     004  ...
...
2     001  ...
2     002  ...
2     003  ...
2     004  ...

我们的最终目标是每次访问选择10％进行测试。我尝试使用proc SURVEYSELECT进行SRS而无需更换并使用＆＃34; VISIT＆＃34;作为阶层。但最终的样本会有重复的ID。例如，可以在VISIT = 1和VISIT = 2中选择ID = 001。

有没有办法使用SURVEYSELECT或其他程序（R也没关系）？非常感谢。

Answer 1

这可以通过一些相当有创意的数据步骤编程实现。下面的代码使用贪婪的方法，依次对每次访问进行采样，仅采样以前未采样过的ID。如果已经对超过90％的访问ID进行了抽样，则输出的不到10％。在极端情况下，当访问的每个id都已被采样时，不会为该访问输出任何行。

/*Create some test data*/
data test_data;
  call streaminit(1);
  do visit = 1 to 1000;
    do id = 1 to ceil(rand('uniform')*1000);
      output;
    end;
  end;
run;


data sample;
  /*Create a hash object to keep track of unique IDs not sampled yet*/
  if 0 then set test_data;
  call streaminit(0);
  if _n_ = 1 then do;
    declare hash h();
    rc = h.definekey('id');
    rc = h.definedata('available');
    rc = h.definedone();
  end;
  /*Find out how many not-previously-sampled ids there are for the current visit*/
  do ids_per_visit = 1 by 1 until(last.visit);
    set test_data;
    by visit;
    if h.find() ne 0 then do;
      available = 1;
      rc = h.add();
    end;
    available_per_visit = sum(available_per_visit,available);
  end;
  /*Read through the current visit again, randomly sampling from the not-yet-sampled ids*/
  samprate = 0.1;
  number_to_sample = round(available_per_visit * samprate,1);
  do _n_ = 1 to ids_per_visit;
    set test_data;
    if available_per_visit > 0 then do;
      rc = h.find();
      if available = 1 then do;
        if rand('uniform') < number_to_sample / available_per_visit then do;
          available = 0;
          rc = h.replace();
          samples_per_visit = sum(samples_per_visit,1);
          output;
          number_to_sample = number_to_sample - 1;
        end;
        available_per_visit = available_per_visit - 1;
      end;
    end;
  end;
run;

/*Check that there are no duplicate IDs*/
proc sort data = sample out = sample_dedup nodupkey;
by id;
run;

在纵向数据中无需替换的随机抽样

1 个答案: