Question

因此，假设我有一个包含五个观测值和两列的数据集。

A       B
Orange  Banana
Plum    Apple
Banana  Orange
Plum    Grape
Grape   Apple

我想删除A ||中重复的行B等于B || A，即删除观察值A =香蕉和B =橙色，因为先前在数据集中观察到A =橙色和B =香蕉。

Answer 1

您可以在一行中对值进行排序，以便观察值A =香蕉和B =橙色，观察值A =橙色和B =香蕉都变为A =香蕉和B =橙色。

以下使用CALL SORTC对它们进行排序。假设您不想丢失原始变量，它将使用一个视图来创建已排序变量的新副本。排序后，您可以按任意方式进行重复处理。

data have ;
  input a $8. b $8. ;
  cards ;
Orange  Banana
Plum    Apple
Banana  Orange
Plum    Grape
Grape   Apple
;


data myview/view=myview ;
  set have ;
  mya=a ;
  myb=b ;
  call sortc(mya,myb) ;
run ;


proc sort nodupkey data=myview out=want(drop=mya myb) ;
  by mya myb ;
run ;

Answer 2

只需对值进行排序，例如使A小于A和B。

只有两个变量时很容易。

proc sql ;
create table want as
  select distinct
    case when (a<b) then a else b end as A
   ,case when (a<b) then b else a end as B
  from have
;
quit;

Answer 3

考虑一个 n 字段的更一般情况，该字段包含一个复合键，其中键值的排序顺序是重复数据消除因子。

经过排序和定界连接的组合键字段的哈希可用于检查先前是否存在。

在此示例中，键字段值被复制到并行数组中，以便sortc可以对它们进行排序，而不会干扰原始数据。第一个出现的键是output。

data have;
  call streaminit(123);

  do row = 1 to 1e5;
    array numfields numfield1-numfield5;
    do over numfields;
      numfields = floor(rand('uniform', 5));
    end;
    length charfield1-charfield5 $8;
    array charfields charfield1-charfield5;
    do over charfields;
      charfields = byte(65 + floor(rand('uniform', 5)));
    end;
    output;
  end;
run;

data want;
  set have;
  array keys(10) $200 _temporary_ ;

  array nums numfield1-numfield5;
  array chars charfield1-charfield5;

  _index = 0;
  do _numindex = 1 to dim(nums);
    _index + 1;
    keys(_index) = put(nums(_numindex),RB8.);
  end;

  do _charindex = 1 to dim(chars);
    _index + 1;
    keys(_index) = chars(_charindex);
  end;

  call sortc (of keys(*));

  _sorted_composite_key = catx('ff'x, of keys(*));

  if _n_ = 1 then do;
    declare hash sortedKeys ();
    sortedKeys.defineKey('_sorted_composite_key');
    sortedKeys.defineDone();
  end;

  if sortedKeys.check() ne 0 then do;
    output;
    sortedKeys.add();
  end;

  drop _:;
run;

删除A ||中的重复项B等于B ||一种

3 个答案: