Question

我对列ID，a，b，c和d进行了观察。我想计算列a，b，c和d中唯一值的数量。所以：

我想：

我无法弄清楚如何计算每一行中的不同，我可以在多行之间但在行内的列中进行计数，我不知道。任何帮助，将不胜感激。谢谢

********* ** UPDATE ************************************************** *** 谢谢所有回复的人!!

我使用了一种不同的方法（效率较低），我觉得我更了解。我仍然会研究下面列出的方法，但要学习正确的方法。以下是我所做的，以防有人想知道：我创建了四个表，在每个表中我创建了一个名为'abcd'的变量，并在该名称下放置了一个变量。

所以它是这样的：

PROC SQL;
CREATE TABLE table1_a AS
    SELECT 
        *
        a as abcd
    FROM table_I_have_with_all_columns
;
QUIT;

PROC SQL;
CREATE TABLE table2_b AS
    SELECT 
        *
        b as abcd
    FROM table_I_have_with_all_columns
;
QUIT;

PROC SQL;
CREATE TABLE table3_c AS
    SELECT 
        *
        c as abcd
    FROM table_I_have_with_all_columns
;
QUIT;

PROC SQL;
CREATE TABLE table4_d AS
    SELECT 
        *
        d as abcd
    FROM table_I_have_with_all_columns
;
QUIT;

然后我把它们堆叠起来（这意味着我有重复的行但是没关系，因为我只想要1列中的所有变量，我可以做不同的计数。

data ALL_STACK;
    set 
    table1_a
    table1_b
    table1_c
    table1_d
;
run;

然后我计算了按ID

分组的'abcd'中的所有唯一值

PROC SQL ;
    CREATE TABLE count_unique AS
    SELECT
    My_id,
    COUNT(DISTINCT abcd) as Count_customers
    FROM ALL_STACK
    GROUP BY my_id
    ;
RUN;

显然，复制一个表4次只是为了将变量放在同一名称下然后堆叠它们是没有效率的。但我的表格有点小，我可以做到，然后在堆栈后立即删除它们。如果你有一个非常大的数据集，这种方法肯定会很麻烦。我使用这种方法而不是其他方法，因为我试图使用Procs而不是循环等。

Answer 1

对数组中重复项的线性搜索是O（n ²），对小n来说完全没问题。 a b c d的n为4。

搜索会评估数组中的每一对，并且其流程与冒泡排序非常相似。

data have;
  input id a b c d; datalines;
11 2 3 4 4
22 1 8 1 1
33 6 . 1 2
44 . 1 1 .
55 . . . .
66 1 2 3 4
run;

每行都会进行重复项的线性搜索，count_distinct会在每一行自动初始化为缺失（。）值。当在任何先前的数组索引中找不到非缺失值时，sum函数用于递增计数。

* linear search O(N**2);

data want;
  set have;
  array x a b c d;

  do i = 1 to dim(x) while (missing(x(i)));
  end;
  if i <= dim(x) then count_distinct = 1;
  do j = i+1 to dim(x);
    if missing(x(j)) then continue;
    do k = i to j-1 ;
      if x(k) = x(j) then leave;
    end;
    if k = j then count_distinct = sum(count_distinct,1);
  end;
  drop i j k;
run;

Answer 2

尝试转置数据集，每个ID变为一列，通过选项nlevels对每个ID列进行频率计数，计算值的频率，然后与原始数据集合并。

Proc transpose data=have prefix=ID out=temp;
id ID;
run;

Proc freq data=temp nlevels;
table ID:;
ods output nlevels=count(keep=TableVar NNonMisslevels);
run;

data count;
   set count;
   ID=compress(TableVar,,'kd');
   drop TableVar;
run;

data want;
    merge have count;
    by id;
run;

Answer 3

使用sortn和使用条件的另一种方法。

data have;
  input id a b c d; datalines;
  11 2 3 4 4
  22 1 8 1 1
  33 6 . 1 2
  44 . 1 1 .
  55 . . . .
  66 1 2 3 4
  77 . 3 . 4
  88 . 9 5 .
  99 . . 2 2
  76 . . . 2
  58 1 1 . .
  50 2 . 2 .
  66 2 . 7 .
  89 1 1 1 .
  75 1 2 3 .
  76 . 5 6 7
  88 . 1 1 1
  43 1 . . 1
  31 1 . . 2
  ;

data want;
set have;
_a=a; _b=b; _c=c; _d=d; 
array hello(*) _a _b _c _d;
call sortn(of hello(*));
if a=. and b = . and c= . and d =. then count=0;
else count=1;
do i = 1 to dim(hello)-1;
if hello(i) = . then count+ 0;
else if hello(i)-hello(i+1) = . then count+0;
else if hello(i)-hello(i+1) = 0  then count+ 0;
else if hello(i)-hello(i+1) ne 0  then count+ 1;
end;
drop i _:;
run;

Answer 4

您可以将唯一值放入临时数组中。让我们将您的照片转换为数据。

data have;
  input id a b c d; 
datalines;
11 2 3 4 4
22 1 8 1 1
33 6 . 1 2
44 . 1 1 .
;

因此，创建一个输入变量数组和另一个临时数组来保存唯一值。然后遍历输入变量并保存唯一值。最后计算有多少个唯一值。

data want ;
  set have ;
  array unique (4) _temporary_;
  array values a b c d ;
  call missing(of unique(*));
  do _n_=1 to dim(values);
    if not missing(values(_n_)) then
      if not whichn(values(_n_),of unique(*)) then 
        unique(_n_)=values(_n_)
    ;
  end;
  count=n(of unique(*));
run;

输出：

Obs    id    a    b    c    d    count

 1     11    2    3    4    4      3
 2     22    1    8    1    1      2
 3     33    6    .    1    2      3
 4     44    .    1    1    .      1

SAS比较同一观察的多个列的值？

4 个答案: