简化我的表格描述以使我的问题简明扼要...
我有一个包含3列的数据集。第一列包含100个 费用类别 (即唯一键),第二列包含给定费用类别的 费用 ,第三个包含 销售单位 。
我的目标是将其转换为一个表格,其中包含每个 CostCat 的列,其中包含 费用 字段的总和对于该给定类别,按 UnitsSold 分组。
即
╔════════════╦══════════╦══════════╦═══════
║ UnitsSold ║ CatCost1 ║ CatCost2 ║ CostCat...
╠════════════╬══════════╬══════════╬═══════
║ 1 ║ 50 ║ 10 ║ ...
║ 2 ║ 20 ║ 15 ║ ...
║ ... ║ ... ║ ... ║ ...
╚════════════╩══════════╩══════════╩═══════
我倾向于使用这样的代码:
PROC SQL;
CREATE TABLE cartesian AS
SELECT
UnitsSold,
SUM(CASE WHEN CostCat=1 THEN Cost else 0 end) as CostCat1,
sum(case when CostCat=1 then Cost else 0 end) as CostCat2,
sum(case when CostCat=1 then Cost else 0 end) as CostCat3,
...
sum(case when CostCat=100 then Cost else 0 end) as CostCat100
GROUP BY UnitsSold;
QUIT;
我想知道是否有更有效的方法来做这个而不是写出大量的CASE陈述? (显然使用Excel生成实际输入)。
我想有可能存在某种类型的宏循环,但对宏知之甚少并不熟悉。
我传统上使用PROC SQL,所以这是我的首选,但也可以使用SAS代码解决方案
答案 0 :(得分:2)
迈克尔:
问题是描述PIVOT操作,在SAS术语中也称为TRANSPOSE,在Excel中称为粘贴/特殊转置或PIVOT表。
如果您坚持使用Proc SQL语句,则没有PIVOT运算符。 SQL Server和其他数据库确实有PIVOT运算符。但是假设你坚持使用SAS Proc SQL。你是正确的,你需要那些CASE语句才能创建跨变量。
有很多方法可以在SAS中转移数据。以下是六种方式:
data have;
do row = 1 to 500;
cost_cat = ceil(100 * ranuni(123));
cost = 10 + floor(50 * ranuni(123));
units_sold = floor (20 * ranuni(123));
output;
end;
run;
在表语句中使用类变量来布局行和列。
proc tabulate data=have;
class cost_cat units_sold;
var cost;
table units_sold, cost_cat*cost*sum / nocellmerge;
run;
成本类别和成本列是堆叠的。 Cost
没有define
语句,默认为display sum
。总和是针对成本而不是每组*中的值执行的:
proc report data=have;
columns units_sold (cost_cat, cost) ;
define units_sold / group;
define cost_cat / across;
run;
转置将创建一个列'乱序'的数据集,因为列是按照逐步通过units_solds时id值出现的顺序创建的。
这可以通过添加来防止额外数据到have
。数据将具有units_sold = -1,并且每个cost_cat值都会有一行。额外的组将作为TRANSPOSE out =数据集选项的一部分删除 - 例如:(... where=(units_sold ne -1))
proc means noprint data=have;
class units_sold cost_cat;
var cost;
ways 2;
output sum=sum out=haveMeans ;
run;
proc transpose data=haveMeans out=wantAcross1(drop=_name_) prefix=cost_sum_;
by units_sold;
var sum;
id cost_cat;
;
run;
宏更简单,因为它特定于相关数据集。 对于更一般的情况,语句生成的显着方面可以被抽象化并进一步宏观化(参见方式5)
%macro pivot_across;
%local i;
proc sql;
create table wantAcross2 as
select units_sold
%do i = 1 %to 100; %* codegen many sql select expressions;
, sum ( case when cost_cat = &i then cost else 0 end ) as cost_sum_&i
%end;
from have
group by units_sold;
quit;
%mend;
%pivot_across;
提示:通过一些更改,代码可以是Proc SQL传递并远程执行数据透视。
完全没有任何数据集。当前形式的此宏处理数字的id变量,其值可以完全表示为cats()
发出的感知数字文字。 更健壮的版本将检查id变量的类型,并引用与生成的CASE语句中的id值进行比较。最强大的版本将有一个代码生成的CASE语句,用于检查每put(..., RB8.)
%macro sql_transpose (data=, out=, by=, var=, id=, aggregate_function=sum, default=0, prefix=, suffix=);
/*
* CASE statement codegener will need tweaking to handle character id variables (i.e. QUOTE of the &id)
* CASE statement codegener will need tweaking to handle numeric id variables that have non-integer values
* inexpressible as a simple source code numeric literal. (i.e. may need to compare data when formnatted as RB4.);
*/
%local case_statements;
proc sql noprint;
select
"&aggregate_function ("
|| "CASE when &id = " || cats(idValues.&id) || " then &var else &default end"
|| ") as &prefix" || cats(idValues.&id) || "&suffix"
into :case_statements
separated by ','
from (select distinct &id from &data) as idValues
order by &id
;
%*put NOTE: %superq(case_statements);
create table &out as
select &by, &case_statements
from &data
group by &by;
quit;
%mend;
%sql_transpose
( data=have
, out=wantAcross3
, by=units_sold
, id=cost_cat
, var=cost
, prefix=cost_sum_
);
提示:通过一些更改,代码可以是Proc SQL传递并远程执行数据透视。需要特别注意收集case_statements
背后的数据。
如果你是一个嗜好者,这段代码似乎并不奢侈。
data _null_;
if 0 then set have(keep=units_sold cost_cat cost); * prep pdv;
* hash for tracking id values;
declare hash ids(ordered:'a');
ids.defineKey('cost_cat');
ids.defineDone();
* hash for tracking sums
* NOTE: The data node has a sum variable instead of using
* argument tags suminc: and keysum: This was done because HITER NEXT() does not
* automatically load the keysum value into its PDV variable (meaning
* another lookup via .SUM() would have to occur in order to obtain it);
call missing (cost_sum);
declare hash sums(ordered:'a');
sums.defineKey('units_sold', 'cost_cat');
sums.defineData('units_sold', 'cost_cat', 'cost_sum');
sums.defineDone();
* scan the data - track the id values and sums for pivoted output;
do while (not done);
set have(keep=units_sold cost_cat cost) end=done;
ids.ref();
if (0 ne sums.find()) then cost_sum = 0;
cost_sum + cost;
sums.replace();
end;
* create a dynamic output target;
* a pool of pdv host variables is required for target;
array cells cost_sum_1 - cost_sum_10000;
call missing (of cost_sum_1 - cost_sum_10000);
* need to iterate across the id values in order to create a
* variable name that will be part of the wanted data node;
declare hiter across('ids');
declare hash want(ordered:'a');
want.defineKey('units_sold');
want.defineData('units_sold');
do while (across.next() = 0);
want.defineData(cats('cost_sum_',cost_cat)); * sneaky! ;
end;
want.defineDone();
* populate target;
* iterate through the sums filling in the PDV variables
* associated with the dynamically defined data node;
declare hiter item('sums');
prior_key1 = .; prior_key2 = .;
do while (item.next() = 0);
if units_sold ne prior_key1 then do;
* when the 'group' changes add an item to the want hash, which will reflect the state of the PDV;
if prior_key1 ne . then do;
key1_hold = units_sold;
units_sold = prior_key1;
want.add(); * save 'row' to hash;
units_sold = key1_hold;
call missing (of cells(*));
end;
end;
cells[cost_cat] = cost_sum;
prior_key1 = units_sold;
end;
want.add();
* output target;
want.output (dataset:'wantAcross4');
stop;
run;
Proc COMPARE
会显示所有want
输出都相同。
proc compare nomissing
noprint data=wantAcross1 compare=wantAcross2 out=diff1_2 outnoequal;
id units_sold;
run;
proc compare
noprint data=wantAcross2 compare=wantAcross3 out=diff2_3 outnoequal;
id units_sold;
run;
proc compare nomissing
noprint data=wantAcross3 compare=wantAcross4 out=diff3_4 outnoequal;
id units_sold;
run;
答案 1 :(得分:0)
正如Reeza所指出的,最好的方法可能是proc sql
或proc means/summary
和proc transpose
的组合。我假设您了解SQL,所以我首先要进入该描述。
proc sql;
create table tmp as
select UnitsSold, CostCat, sum(cost) as cost
from have
group by UnitsSold, CostCat;
quit;
如果您想通过SAS程序执行此操作,可以使用proc summary
。
proc summary data=have nway missing;
class UnitsSold CostCat;
var Cost;
output out=tmp(drop=_:) sum=; ** drop=_: removes the automatic variables created in the procedure;
run;
现在该表已按UnitsSold
和CostCat
进行汇总和排序,您可以转置该表。
proc transpose data=tmp out=want(drop=_NAME_) prefix=CostCat;
by UnitsSold;
id CostCat;
var cost;
run;