SAS - PROC SQL - 将值汇总到唯一列中

时间:2017-12-01 20:13:58

标签: sas proc-sql

简化我的表格描述以使我的问题简明扼要...

我有一个包含3列的数据集。第一列包含100个 费用类别 (即唯一键),第二列包含给定费用类别的 费用 ,第三个包含 销售单位

我的目标是将其转换为一个表格,其中包含每个 CostCat 的列,其中包含 费用 字段的总和对于该给定类别,按 UnitsSold 分组。


╔════════════╦══════════╦══════════╦═══════
║  UnitsSold ║ CatCost1 ║ CatCost2 ║ CostCat...
╠════════════╬══════════╬══════════╬═══════
║    1       ║    50    ║    10    ║ ...
║    2       ║    20    ║    15    ║ ...
║    ...     ║    ...   ║    ...   ║ ...
╚════════════╩══════════╩══════════╩═══════

我倾向于使用这样的代码:

PROC SQL;
CREATE TABLE cartesian AS
SELECT
  UnitsSold,
  SUM(CASE WHEN CostCat=1 THEN Cost else 0 end) as CostCat1,
  sum(case when CostCat=1 then Cost else 0 end) as CostCat2,
  sum(case when CostCat=1 then Cost else 0 end) as CostCat3,
  ...
  sum(case when CostCat=100 then Cost else 0 end) as CostCat100
GROUP BY UnitsSold;
QUIT;

我想知道是否有更有效的方法来做这个而不是写出大量的CASE陈述? (显然使用Excel生成实际输入)。

我想有可能存在某种类型的宏循环,但对宏知之甚少并不熟悉。

我传统上使用PROC SQL,所以这是我的首选,但也可以使用SAS代码解决方案

2 个答案:

答案 0 :(得分:2)

迈克尔:

问题是描述PIVOT操作,在SAS术语中也称为TRANSPOSE,在Excel中称为粘贴/特殊转置或PIVOT表。

如果您坚持使用Proc SQL语句,则没有PIVOT运算符。 SQL Server和其他数据库确实有PIVOT运算符。但是假设你坚持使用SAS Proc SQL。你是正确的,你需要那些CASE语句才能创建跨变量。

有很多方法可以在SAS中转移数据。以下是六种方式:

样本数据

data have;
  do row = 1 to 500;
    cost_cat = ceil(100 * ranuni(123));
    cost = 10 + floor(50 * ranuni(123));
    units_sold = floor (20 * ranuni(123));
    output;
  end;
run;

方式1 - Proc TRANSPOSE:仅用于演示

在表语句中使用类变量来布局行和列。

proc tabulate data=have;
  class cost_cat units_sold;
  var cost;
  table units_sold, cost_cat*cost*sum / nocellmerge;
run;

方式2 - 过程报告:仅用于演示

成本类别和成本列是堆叠的。 Cost没有define语句,默认为display sum。总和是针对成本而不是每组*中的值执行的:

proc report data=have;
  columns units_sold (cost_cat, cost) ;
  define units_sold / group;
  define cost_cat / across;
run;

方式3 - Proc MEANS + Proc TRANSPOSE:数据透视

转置将创建一个列'乱序'的数据集,因为列是按照逐步通过units_solds时id值出现的顺序创建的。
这可以通过添加来防止额外数据到have。数据将具有units_sold = -1,并且每个cost_cat值都会有一行。额外的组将作为TRANSPOSE out =数据集选项的一部分删除 - 例如:(... where=(units_sold ne -1))

proc means noprint data=have;
  class units_sold cost_cat;
  var cost;
  ways 2;
  output sum=sum out=haveMeans ;
run;

proc transpose data=haveMeans out=wantAcross1(drop=_name_) prefix=cost_sum_;
  by units_sold;
  var sum;
  id cost_cat;
  ;  
run;

方式4 - 由Macro生成的SQL`wallies`代码:特定于一个数据集

宏更简单,因为它特定于相关数据集。 对于更一般的情况,语句生成的显着方面可以被抽象化并进一步宏观化(参见方式5)

%macro pivot_across;
  %local i;

  proc sql;
    create table wantAcross2 as
    select units_sold
    %do i = 1 %to 100;  %* codegen many sql select expressions;
    , sum ( case when cost_cat = &i then cost else 0 end ) as cost_sum_&i
    %end;
    from have
    group by units_sold;
  quit;
%mend;

%pivot_across;

提示:通过一些更改,代码可以是Proc SQL传递并远程执行数据透视。

方式5 - 由Macro生成的SQL`wallies`代码:任何数据集

完全没有任何数据集。当前形式的此宏处理数字的id变量,其值可以完全表示为cats()发出的感知数字文字。 更健壮的版本将检查id变量的类型,并引用与生成的CASE语句中的id值进行比较。最强大的版本将有一个代码生成的CASE语句,用于检查每put(..., RB8.)

的id值
%macro sql_transpose (data=, out=, by=, var=, id=, aggregate_function=sum, default=0, prefix=, suffix=);

/*
 * CASE statement codegener will need tweaking to handle character id variables (i.e. QUOTE of the &id)
 * CASE statement codegener will need tweaking to handle numeric id variables that have non-integer values
 * inexpressible as a simple source code numeric literal. (i.e. may need to compare data when formnatted as RB4.);
 */

  %local case_statements;

  proc sql noprint;
    select
    "&aggregate_function ("
    || "CASE when &id = " || cats(idValues.&id) || " then &var else &default end"   
    || ") as &prefix" || cats(idValues.&id) || "&suffix"
    into :case_statements
    separated by ','
    from (select distinct &id from &data) as idValues
    order by &id
    ;

  %*put NOTE: %superq(case_statements);

    create table &out as
    select &by, &case_statements
    from &data
    group by &by;
  quit;

%mend;

%sql_transpose 
( data=have
, out=wantAcross3
, by=units_sold
, id=cost_cat
, var=cost
, prefix=cost_sum_
);

提示:通过一些更改,代码可以是Proc SQL传递并远程执行数据透视。需要特别注意收集case_statements背后的数据。

方式6 - 哈希表:数字索引的枢轴列

如果你是一个嗜好者,这段代码似乎并不奢侈。

data _null_;
  if 0 then set have(keep=units_sold cost_cat cost); * prep pdv;

  * hash for tracking id values;

  declare hash ids(ordered:'a');
  ids.defineKey('cost_cat');
  ids.defineDone();

  * hash for tracking sums
  * NOTE: The data node has a sum variable instead of using 
  * argument tags suminc: and keysum:  This was done because HITER NEXT() does not 
  * automatically load the keysum value into its PDV variable (meaning
  * another lookup via .SUM() would have to occur in order to obtain it);

  call missing (cost_sum);

  declare hash sums(ordered:'a');
  sums.defineKey('units_sold', 'cost_cat');
  sums.defineData('units_sold', 'cost_cat', 'cost_sum');
  sums.defineDone();

  * scan the data - track the id values and sums for pivoted output;

  do while (not done);
    set have(keep=units_sold cost_cat cost) end=done;

    ids.ref();

    if (0 ne sums.find()) then cost_sum = 0;
    cost_sum + cost;
    sums.replace();
  end;

  * create a dynamic output target;
  * a pool of pdv host variables is required for target;

  array cells cost_sum_1 - cost_sum_10000;
  call missing (of cost_sum_1 - cost_sum_10000);

  * need to iterate across the id values in order to create a 
  * variable name that will be part of the wanted data node;

  declare hiter across('ids');

  declare hash want(ordered:'a');
  want.defineKey('units_sold');
  want.defineData('units_sold');
  do while (across.next() = 0);
    want.defineData(cats('cost_sum_',cost_cat));  * sneaky! ;
  end;
  want.defineDone();

  * populate target;
  * iterate through the sums filling in the PDV variables
  * associated with the dynamically defined data node;

  declare hiter item('sums');
  prior_key1 = .; prior_key2 = .;
  do while (item.next() = 0);
    if units_sold ne prior_key1 then do;
      * when the 'group' changes add an item to the want hash, which will reflect the state of the PDV;
      if prior_key1 ne . then do;
        key1_hold = units_sold;
        units_sold = prior_key1;
        want.add();                  * save 'row' to hash;
        units_sold = key1_hold;
        call missing (of cells(*));
      end;
    end;

    cells[cost_cat] = cost_sum;
    prior_key1 = units_sold;
  end;
  want.add();

  * output target;

  want.output (dataset:'wantAcross4');

  stop;
run;

验证

Proc COMPARE会显示所有want输出都相同。

proc compare nomissing 
  noprint data=wantAcross1 compare=wantAcross2 out=diff1_2 outnoequal;
  id units_sold;
run;

proc compare 
  noprint data=wantAcross2 compare=wantAcross3 out=diff2_3 outnoequal;
  id units_sold;
run;

proc compare nomissing 
  noprint data=wantAcross3 compare=wantAcross4 out=diff3_4 outnoequal;
  id units_sold;
run;

答案 1 :(得分:0)

正如Reeza所指出的,最好的方法可能是proc sqlproc means/summaryproc transpose的组合。我假设您了解SQL,所以我首先要进入该描述。

proc sql;
create table tmp as
select UnitsSold, CostCat, sum(cost) as cost
from have
group by UnitsSold, CostCat;
quit;

如果您想通过SAS程序执行此操作,可以使用proc summary

proc summary data=have nway missing;
class UnitsSold CostCat;
var Cost;
output out=tmp(drop=_:) sum=;  ** drop=_: removes the automatic variables created in the procedure;
run;

现在该表已按UnitsSoldCostCat进行汇总和排序,您可以转置该表。

proc transpose data=tmp out=want(drop=_NAME_) prefix=CostCat;
by UnitsSold;
id CostCat;
var cost;
run;