我在SAS中寻找一种更好的方法来计算某个单词出现在字符串中的次数。例如,搜索' wood'在字符串中:
how much wood could a woodchuck chuck if a woodchuck could chuck wood
...会返回2
的结果。
这就是我通常会这样做的方式,但它有很多代码:
data _null_;
length sentence word $200;
sentence = 'how much wood could a woodchuck chuck if a woodchuck could chuck wood';
search_term = 'wood';
found_count = 0;
cnt=1;
word = scan(sentence,cnt);
do while (word ne '');
num_times_found = sum(num_times_found, word eq search_term);
cnt = cnt + 1;
word = scan(sentence,cnt);
end;
put num_times_found=;
run;
我可以将它放入fcmp
函数中以使其更优雅,但我仍然觉得必须有更友好和更简洁的代码。
答案 0 :(得分:3)
从Code Review的角度来看,上述情况可以有所改善。 do循环可以处理cnt
增量,如果将其切换为until
,您甚至不必进行初始分配。你还有一个无关的变量found_count
,不知道那是什么。否则,我认为这是合理的,至少对于非复杂的解决方案而言。
data _null_;
length sentence word $200;
sentence = 'how much wood could a woodchuck chuck if a woodchuck could chuck wood';
search_term = 'wood';
do cnt=1 by 1 until (word eq '');
word = scan(sentence,cnt);
num_times_found = sum(num_times_found, word eq search_term);
end;
put num_times_found=;
run;
它也非常快 - 我的盒子上的1e6次迭代时间不到9秒。将o
添加到字符串选项时的PRX解决方案花费的时间更少(6秒),因此在使用非常大的数据集或大量变量时可能更好,但我怀疑与i /相比,增加的时间将是显着的时间。 FCMP解决方案与此解决方案的时间顺序相同(均为8-9秒左右)。最后,FINDW解决方案是最快的,大约2秒钟。
答案 1 :(得分:3)
当FINDW有效扫描你时,没有理由扫描所有单词。
33 data _null_;
34 length sentence search_term $200;
35 sentence = 'how much wood could a woodchuck chuck if a woodchuck could chuck wood';
36 search_term = 'wood';
37 cnt=0;
38 do s=findw(sentence,strip(search_term),1) by 0 while(s);
39 cnt+1;
40 s=findw(sentence,strip(search_term),s+1);
41 end;
42 put cnt= search_term=;
43 stop;
44 run;
cnt=2 search_term=wood
答案 2 :(得分:2)
尝试用prxchange掉木头,然后再计算。
data _null_;
sentence = 'how much wood could a woodchuck chuck if a woodchuck could chuck wood';
count=countw(sentence,' ')-countw(prxchange('s/wood/$1/i',-1,sentence),' ');
put _all_;
run;
答案 3 :(得分:2)
为了完整起见,这里它是一个fcmp函数:
FCMP定义:
options cmplib=work.temp.temp;
proc fcmp outlib=work.temp.temp;
function word_freq(sentence $, search_term $) ;
length sentence word $200;
do cnt=1 by 1 until (word eq '');
word = scan(sentence,cnt);
num_times_found = sum(num_times_found, word eq search_term);
end;
return (num_times_found);
endsub;
run;
<强>用法:强>
data _null_;
num_times_found = word_freq('how much wood could a woodchuck chuck if a woodchuck could chuck wood','wood');
put num_times_found=;
run;
<强>结果:强>
num_times_found=2