我有两个数据集,一个用于男性,一个用于女性,包含相同的变量。我需要按组找到每个变量的性别之间的百分比差异。
数据集看起来像这样,但有更多的变量和组,
| Group | Sex | VarA | VarB |
|-------+-----+------+------|
| 1 | F | 8 | 5 |
| 2 | F | 6 | 3 |
| 3 | F | 7 | 0 |
|-------+-----+------+------|
| Group | Sex | VarA | VarB |
|-------+-----+------+------|
| 1 | M | 9 | 7 |
| 2 | M | 8 | 5 |
| 3 | M | 6 | 3 |
|-------+-----+------+------|
我需要的结果是:
| Group | percent_diffA | percent_diffB |
|-------+---------------+---------------|
| 1 | -0.117647059 | -0.333333333 |
| 2 | -0.285714286 | -0.5 |
| 3 | 0.153846154 | -2 |
|-------+---------------+---------------|
我可以通过重命名每个变量来解决这个问题。
data difference;
merge
females (rename = (VarA = VarA_F VarB = VarB_F)
males (rename = (VarA = VarA_M VarB = VarB_M)
;
by group;
percent_diffA = (VarA_F - VarA_M) / ( (VarA_F + VarA_M) / 2 );
percent_diffB = (VarB_F - VarB_M) / ( (VarB_F + VarB_M) / 2 );
drop sex;
run;
但是,这种方法要求我手动重命名所有内容。使用多个变量,重命名语句变得很麻烦。不幸的是,这个计算被插入到一些旧代码中,因此重命名原始数据集是不切实际的。
我想知道是否有另一种方法可以解决这个问题,而不是那么麻烦。
编辑:我更新了变量名称,因为这似乎引起了人们的困惑。它们最初称为Var1
和Var2
。他们现在是VarA
和VarB
。实际变量名称是描述性的,例如body_weight_g
或gonadal_somatic_index
。变量不是简单地用序列号列出的。
答案 0 :(得分:1)
对于包含按顺序编号的变量的数据集,有用于重命名整个变量范围的变量列表语法:
此示例创建包含100个变量的样本。
data have1 have2;
do group = 1 to 100;
sex = 'M';
array var(100);
do _n_ = 1 to dim(var);
var(_n_) = ceil (25 * ranuni(123));
end;
if group ne 42 then output have1;
sex = 'F';
do _n_ = 1 to dim(var);
var(_n_) = ceil (25 * ranuni(123));
end;
if group ne 100-42 then output have2;
end;
run;
rename
选项适用于所有100个变量。
data want;
merge
have1(rename=var1-var100=mvar1-mvar100 in=_M)
have2(rename=var1-var100=fvar1-fvar100 in=_F)
;
by group;
if _M & _F & first.group & last.group then do;
array one mvar1-mvar100;
array two fvar1-fvar100;
array results result1-result100;
do i = 1 to dim(results);
diff = one(i) - two(i);
mean = mean (one(i), two(i));
results(i) = diff / mean * 100;
end;
end;
keep group result:;
run;
答案 1 :(得分:1)
盛林的答案是对SQL的简洁使用。 另一种方法是构造一个宏变量,指定在重命名DSO(数据集选项)中使用的重命名。这可以通过对包含列名的字典表的SQL查询来完成。
* This macro creates the macro variable rename_suffix, to be used in a rename statement or data set option ;
* It will be of form: var1 = var1_suffix var2 = var2_suffix ... ;
* &inset is the input set. &suffix is the suffix to added to all variables except for the variables specified in &keys. ;
* &keys variables should be given each in quotation marks, and separated by spaces. ;
%macro rename_list(inset, suffix, keys) ;
%global rename_&inset ; * So that this macro variable is accessable outside the macro ;
proc sql ;
select strip(name) || ' = ' || strip(name) || "_&suffix"
into :rename_&inset separated by ' '
from sashelp.vcolumn /* dictionary.columns can be used in place of sashelp.vcolumn */
where libname = 'WORK' & memname = "%sysfunc(upcase(&inset))"
& upcase(strip(name)) not in (' ' %sysfunc(upcase(&keys))); * The ' ' is included, so there is no error if no keys are given ;
quit ;
%mend rename_list ;
%rename_list(females, F, 'GROUP' 'SEX')
%rename_list(males , M, 'GROUP' 'SEX')
%put &rename_females ; * Check that the macro variables are correct ;
%put &rename_males ;
%macro pct_diff(num) ;
percent_diff&num = (Var&num._F - Var&num._M) / ( (Var&num._F + Var&num._M) / 2 ) ;
%mend pct_diff ;
data difference ;
merge females(rename = (&rename_females), drop = sex)
males (rename = (&rename_males ), drop = sex) ;
by group ;
pct_diff(1) ;
pct_diff(2) ;
run ;
dm 'vt difference';
还可以使用宏缩短percent_diff变量的创建(如图所示)。如果要比较大量和/或可变数量的变量,则可以通过自动检测比较次数来进一步缩短它,通过运行相同的SQL查询并将select into部分修改为
select count(name) into :varct trimmed
计算变量的数量,然后在数据步骤中使用do循环:
do i = 1 to &varct ;
%pct_diff(i) ;
end ;
答案 2 :(得分:0)
在proc sql中使用表别名以避免名称更改:
proc sql;
select a.group,(a.var1-b.var1)/((a.var1+b.var1)/2) as percent_diff1,
(a.var2-b.var2)/((a.var2+b.var2)/2) as percent_diff2
from female as a,male as b
where a.group=b.group;
quit;