我有一个看起来像这样的数据集(请注意,每个产品都有一个空格分隔):
Client_ID Purchase
121212 "Orange_Juice Lettuce"
121212 "Banana Bread "
230102 "Banana Apple"
230102 "Chicken"
121212 "Chicken Bread"
301450 "Grapes Lettuce"
... ...
现在,我希望知道每个人购买的产品,每个项目使用一个虚拟变量:
Client_ID Apple Banana Bread Chicken Grapes Lettuce Orange_Juice
121212 0 1 1 1 0 1 1
230102 1 1 0 1 0 0 0
301450 0 0 0 0 1 1 0
... ... ... ... ... ... ... ...
几个星期前我问了一个similar question,但我没有在同一行中有几个项目,就像这里的情况一样。所以我真的输了。我试图在多列中分隔项目,但这并不理想,因为每次购买都可以有不同数量的项目(据我所知,最多可达数十项)。
有关如何进行的任何想法?提前谢谢!
答案 0 :(得分:2)
这是一个使用PROC FREQ和PROC TRANSPOSE的灵活解决方案。 SPARSE选项可以为您提供零。我假设你只想要1或0,因此NODUPKEY排序;删除NODUPKEY(或完全删除排序)如果你想要第一个ID的BREAD为2。
首先创建一个垂直数据集,每个ID /产品有一条记录(将购买分成产品);那么PROC FREQ就是数据集,所以每个客户/产品组合都有一个1/0的数据集;然后使用产品转换为ID并计为VAR。
如果您要保证的任何产品即使 nobody 都显示为零,您也应该在初始表(或proc freq之前的任何内容)中添加一行虚拟客户端ID和所有可能的产品,然后在转置后删除虚拟客户端ID。
data test;
input @1 Client_ID 6. @16 Purchase $50.;
datalines;
121212 Orange_Juice Lettuce
121212 Banana Bread
230102 Banana Apple
230102 Chicken
121212 Chicken Bread
301450 Grapes Lettuce
;;;;
run;
data vert;
set test;
format product $20.;
do _x = 1 by 1 until (missing(product));
product=scan(purchase,_x);
if not missing(product) then output;
end;
run;
proc sort data=vert nodupkey;
by client_id product;
run;
proc freq data=vert;
tables client_id*product/sparse out=prods;
run;
proc transpose data=prods out=horiz;
by client_id;
id product;
var count;
run;
答案 1 :(得分:0)
这是一个数据步骤编程解决方案:
proc sort data=have;
by client_id;
run;
data want(keep=client_id apple banana bread chicken grapes lettuce orange_juice);
set have;
by client_id;
retain apple banana bread chicken grapes lettuce orange_juice;
if first.client_id then do;
apple = 0;
banana = 0;
bread = 0 ;
chicken = 0;
grapes = 0;
lettuce = 0;
orange_juice = 0;
end;
length item $20;
_x = 1;
item = scan(purchase,_x);
do while(item ne ' ');
select(item);
when('Apple') then apple = 1;
when('Banana') then banana = 1;
when('Bread') then bread = 1;
when('Chicken') then chicken = 1;
when('Grapes') then grapes = 1;
when('Lettuce') then lettuce = 1;
when(("Orange_Juice') then orange_juice = 1;
otherwise;
end;
_x = _x + 1;
item = scan(purchase,_x);
end;
if last.client_id then output;
run;
编辑:我错过了每个PURCHASE
变量中多个项目的部分问题。谢谢乔!
答案 2 :(得分:0)
让SAS数据步骤为您执行一些虚拟变量编码也是一种可行的解决方案。
data test;
input Client_ID 6. Purchase $50.;
datalines;
121212 Orange_Juice Lettuce
121212 Banana Bread
230102 Banana Apple
230102 Chicken
121212 Chicken Bread
301450 Grapes Lettuce
;;;;
run;
filename tmp temp;
data _null_;
set test end = done;
file tmp;
length product $25 prodlist $1000;
retain prodlist;
do i = 1 to countw( purchase, " " );
product = scan( purchase, i, " " );
prodlist = ifc( indexw( prodlist, product )=0, catx( ' ', prodlist, product ), prodlist );
end;
if done then do;
prodlinit=prxchange("s/ /=0; /",-1,compbl(prodlist));
put 'array prods(*) ' prodlist ';' / prodlinit;
end;
run;
data new;
set test;
%inc tmp/source2;
do i = 1 to dim( prods );
if indexw(purchase,vname(prods(i))) > 0 then prods(i) = 1;
end;
run;
proc print;
run;