Question

在某些数据清理过程中，需要比较不同行之间的数据。例如，如果这些行具有相同的国家/地区ID和主题ID ，则保持最高温度：

CountryID      SubjectID     Temperature
1001           501           36
1001           501           38
1001           510           37
1013           501           36
1013           501           39
1095           532           36

在这种情况下，我将按如下方式使用lag()函数。

proc sort table;
    by CountryID SubjectID descending Temperature;
run;
data table_laged;
    set table;
    CountryID_lag = lag(CountryID);
    SubjectID_lag = lag(SubjectID);
    Temperature_lag = lag(Temperature);
    if CountryID = CountryID_lag and SubjectID = SubjectID_lag then do;
        if Temperature < Temperature_lag then delete;
    end;
    drop CountryID_lag SubjectID_lag Temperature_lag;
run;

上面的代码可能有效。

但是我仍然想知道是否有更好的方法来解决此类问题？

Answer 1

我认为您使任务复杂化。您可以使用proc sql和max函数：

proc sql noprint;
   create table table_laged as
   select CountryID,SubjectID,max(Temperature)
   from table
   group by CountryID,SubjectID;
quit;

Answer 2

我不知道您是否这样想，但您的代码将保持最高温度因此，当您对一个主题拥有2 1 3时，将保持3。但是，当您拥有1 4 3 4 4时，它将保持4 44。更好的做法是使每个主题的第一行保持简单，因为其降序顺序最高

proc sort data = table;
    by CountryID SubjectID descending Temperature;
run;
data table_laged;
    set table;
    by CountryID SubjectID;
    if first.SubjectID;
run;

Answer 3

您可以使用双重DOW技术：

计算组的度量，
将度量应用于组中的项目。

DOW循环的好处是当传入的数据已经被分组时，一次遍历数据集。

在此问题中，1.标识组中温度最高的行，2.选择要输出的行。

data want;
  do _n_ = 1 by 1 until (last.SubjectId);
    set have;
    by CountryId SubjectId;
    if temperature > _max_temp then do;
      _max_temp = temperature;
      _max_at_n = _n_;
    end;
  end;
  do _n_ = 1 to _n_;
    set have;
    if _n_ = _max_at_n then OUTPUT;
  end; 
  drop _:;       
run;

传统的程序技术是Proc MEANS

data have;input
CountryID      SubjectID     Temperature; datalines;
1001           501           36
1001           501           38
1001           510           37
1013           501           36
1013           501           39
1095           532           36
run;

proc means noprint data=have;
  by countryid subjectid;
  output out=want(drop=_:) max(temperature)=temperature;
run;

如果进入数据步骤的CountryID和SubjectID中的数据混乱，则可以使用哈希对象或按@Aurieli使用SQL。

有没有更好的方法可以比较SAS中不同行之间的情况？

3 个答案: