计算单词出现的次数

时间:2016-02-12 16:11:12

标签: sas

我在SAS中寻找一种更好的方法来计算某个单词出现在字符串中的次数。例如,搜索' wood'在字符串中:

how much wood could a woodchuck chuck if a woodchuck could chuck wood

...会返回2的结果。

这就是我通常会这样做的方式,但它有很多代码:

data _null_;
  length sentence word $200;

  sentence = 'how much wood could a woodchuck chuck if a woodchuck could chuck wood';
  search_term = 'wood';
  found_count = 0;

  cnt=1;
  word = scan(sentence,cnt);
  do while (word ne '');
    num_times_found = sum(num_times_found, word eq search_term);
    cnt = cnt + 1;
    word = scan(sentence,cnt);
  end;

  put num_times_found=;

run;

我可以将它放入fcmp函数中以使其更优雅,但我仍然觉得必须有更友好和更简洁的代码。

4 个答案:

答案 0 :(得分:3)

从Code Review的角度来看,上述情况可以有所改善。 do循环可以处理cnt增量,如果将其切换为until,您甚至不必进行初始分配。你还有一个无关的变量found_count,不知道那是什么。否则,我认为这是合理的,至少对于非复杂的解决方案而言。

data _null_;
  length sentence word $200;

  sentence = 'how much wood could a woodchuck chuck if a woodchuck could chuck wood';
  search_term = 'wood';

  do cnt=1 by 1 until (word eq '');
    word = scan(sentence,cnt);
    num_times_found = sum(num_times_found, word eq search_term);
  end;

  put num_times_found=;

run;

它也非常快 - 我的盒子上的1e6次迭代时间不到9秒。将o添加到字符串选项时的PRX解决方案花费的时间更少(6秒),因此在使用非常大的数据集或大量变量时可能更好,但我怀疑与i /相比,增加的时间将是显着的时间。 FCMP解决方案与此解决方案的时间顺序相同(均为8-9秒左右)。最后,FINDW解决方案是最快的,大约2秒钟。

答案 1 :(得分:3)

当FINDW有效扫描你时,没有理由扫描所有单词。

33         data _null_;
34            length sentence search_term $200;
35            sentence = 'how much wood could a woodchuck chuck if a woodchuck could chuck wood';
36            search_term = 'wood';
37            cnt=0;
38            do s=findw(sentence,strip(search_term),1) by 0 while(s);
39               cnt+1;
40               s=findw(sentence,strip(search_term),s+1);
41               end;
42            put cnt= search_term=;
43            stop;
44            run;

cnt=2 search_term=wood

答案 2 :(得分:2)

尝试用prxchange掉木头,然后再计算。

data _null_;
sentence = 'how much wood could a woodchuck chuck if a woodchuck could chuck wood';
count=countw(sentence,' ')-countw(prxchange('s/wood/$1/i',-1,sentence),' ');
put _all_;
run;

答案 3 :(得分:2)

为了完整起见,这里它是一个fcmp函数:

FCMP定义:

options cmplib=work.temp.temp;

proc fcmp outlib=work.temp.temp;

  function word_freq(sentence $, search_term $) ;    
    length sentence word $200;

    do cnt=1 by 1 until (word eq '');
      word = scan(sentence,cnt);
      num_times_found = sum(num_times_found, word eq search_term);
    end;

    return (num_times_found);
  endsub;

run;

<强>用法:

data _null_;
  num_times_found = word_freq('how much wood could a woodchuck chuck if a woodchuck could chuck wood','wood');
  put num_times_found=;
run;

<强>结果:

num_times_found=2