从同一行中的多个字符串创建虚拟变量

时间:2012-09-05 12:38:45

标签: sas

我有一个看起来像这样的数据集(请注意,每个产品都有一个空格分隔):

Client_ID      Purchase
121212         "Orange_Juice Lettuce"
121212         "Banana Bread "
230102         "Banana Apple"
230102         "Chicken"
121212         "Chicken Bread"
301450         "Grapes Lettuce"
...            ...

现在,我希望知道每个人购买的产品,每个项目使用一个虚拟变量:

Client_ID    Apple    Banana    Bread    Chicken    Grapes    Lettuce    Orange_Juice
121212       0        1         1        1          0         1          1  
230102       1        1         0        1          0         0          0
301450       0        0         0        0          1         1          0
...          ...      ...       ...      ...        ...       ...        ...

几个星期前我问了一个similar question,但我没有在同一行中有几个项目,就像这里的情况一样。所以我真的输了。我试图在多列中分隔项目,但这并不理想,因为每次购买都可以有不同数量的项目(据我所知,最多可达数十项)。

有关如何进行的任何想法?提前谢谢!

3 个答案:

答案 0 :(得分:2)

这是一个使用PROC FREQ和PROC TRANSPOSE的灵活解决方案。 SPARSE选项可以为您提供零。我假设你只想要1或0,因此NODUPKEY排序;删除NODUPKEY(或完全删除排序)如果你想要第一个ID的BREAD为2。

首先创建一个垂直数据集,每个ID /产品有一条记录(将购买分成产品);那么PROC FREQ就是数据集,所以每个客户/产品组合都有一个1/0的数据集;然后使用产品转换为ID并计为VAR。

如果您要保证的任何产品即使 nobody 都显示为零,您也应该在初始表(或proc freq之前的任何内容)中添加一行虚拟客户端ID和所有可能的产品,然后在转置后删除虚拟客户端ID。

data test;
input @1 Client_ID  6.   @16 Purchase $50.;
datalines;
121212         Orange_Juice Lettuce
121212         Banana Bread 
230102         Banana Apple
230102         Chicken
121212         Chicken Bread
301450         Grapes Lettuce
;;;;
run;

data vert;
set test;
format product $20.;
do _x = 1 by 1 until (missing(product));
  product=scan(purchase,_x);
  if not missing(product) then output;
end;
run;
proc sort data=vert nodupkey;
by client_id product;
run;

proc freq data=vert;
tables client_id*product/sparse out=prods;
run;

proc transpose data=prods out=horiz;
by client_id;
id product;
var count;
run;

答案 1 :(得分:0)

这是一个数据步骤编程解决方案:

proc sort data=have;
   by client_id;
run;
data want(keep=client_id apple banana bread chicken grapes lettuce orange_juice);
   set have;
      by client_id;
   retain apple banana bread chicken grapes lettuce orange_juice;
   if first.client_id then do;
      apple = 0;
      banana = 0;
      bread = 0 ;
      chicken = 0;
      grapes = 0;
      lettuce = 0;
      orange_juice = 0;
      end;
   length item $20;
   _x = 1;
   item = scan(purchase,_x);
   do while(item ne ' ');
      select(item);
         when('Apple') then apple = 1;
         when('Banana') then banana = 1;
         when('Bread') then bread = 1;
         when('Chicken') then chicken = 1;
         when('Grapes') then grapes = 1;
         when('Lettuce') then lettuce = 1;
         when(("Orange_Juice') then orange_juice = 1;
         otherwise;
         end;
      _x = _x + 1;
      item = scan(purchase,_x);
      end;
   if last.client_id then output;
run;

编辑:我错过了每个PURCHASE变量中多个项目的部分问题。谢谢乔!

答案 2 :(得分:0)

让SAS数据步骤为您执行一些虚拟变量编码也是一种可行的解决方案。

data test;
input Client_ID 6. Purchase $50.;
datalines;
121212         Orange_Juice Lettuce
121212         Banana Bread 
230102         Banana Apple
230102         Chicken
121212         Chicken Bread
301450         Grapes Lettuce
 ;;;;
 run;

filename tmp temp;
 data _null_;
 set test end = done;
 file tmp;
 length product $25 prodlist $1000;
 retain prodlist;
 do i = 1 to countw( purchase, " " );
      product = scan( purchase, i, " " );
      prodlist = ifc( indexw( prodlist, product )=0, catx( ' ', prodlist, product ), prodlist );
 end;
 if done then do; 
    prodlinit=prxchange("s/ /=0; /",-1,compbl(prodlist)); 
    put 'array prods(*) ' prodlist ';'  / prodlinit;
 end;
 run;

 data new;
  set test;
   %inc tmp/source2;
   do i = 1 to dim( prods );
     if indexw(purchase,vname(prods(i))) > 0 then prods(i) = 1;
   end; 
  run;

proc print;
run;