使用SAS中表中的单词从字段中删除单词

时间:2019-04-08 11:30:16

标签: sql sas

我正在尝试使用单词表从SAS字段中删除单词。

我已经能够使用我在网上找到的一些代码来隔离每个单词,但是我无法从该字段中删除该单词。

例如,如果字段为:

“狐狸跳上了月亮”

如果单词“ jumped”在单词列表中,则结果应类似于:

“月亮上的狐狸”

这是要删除的停用词表:

PROC SQL;
   CREATE TABLE BOW.QUERY_FOR_STOPWORDS AS 
   SELECT t1.StopWords
      FROM BOW.STOPWORDS t1;
QUIT;

这是带有要删除的字段的表:

PROC SQL;
   CREATE TABLE WORK.QUERY_FOR_ANNU_COMMENTS AS 
   SELECT t1.Comment
      FROM BOW.ANNU_COMMENTS t1;
QUIT;

4 个答案:

答案 0 :(得分:1)

Depending off how much words you have other solutions.

data _NULL_;
    set STOPWORDS end=e;
    if _N_=1 then call execute('data result;set ANNU_COMMENTS;newComment=Comment;');
    call execute('if _N_=1 then __'||put(_N_,z30.)||'+prxparse("s/'||trimn(StopWords)||'//");');
    call execute('call prxchange(__'||put(_N_,z30.)||',-1,newComment);');   
    if e then call execute('drop __:;run;');
run;

This will take stopwords an generate datastep from it than this datastep process comments.

EDIT: To remove only words by word boundary you have to use \b in the regex.

data _NULL_;
    set STOPWORDS end=e;
    if _N_=1 then call execute('data result;set ANNU_COMMENTS;newComment=Comment;');
    call execute('if _N_=1 then __'||put(_N_,z30.)||'+prxparse("s/\b'||trimn(StopWords)||'\b//");');
    call execute('call prxchange(__'||put(_N_,z30.)||',-1,newComment);');   
    if e then call execute('drop __:;run;');
run;

答案 1 :(得分:0)

基本思想是replace()

SELECT REPLACE(t1.Comment, 'jumped', '')
FROM BOW.ANNU_COMMENTS t1;

但是,您在使用空格时遇到了问题。如果这是一个问题,并且您想要完整的单词,那么这可能会起作用:

SELECT TRIM(BOTH ' ' FROM REPLACE(' ' || t1.Comment || ' ', ' jumped ', ''))
FROM BOW.ANNU_COMMENTS t1;

答案 2 :(得分:0)

您可以尝试使用哈希迭代器,例如:

data want;
   if 0 then set STOPWORDS;
   if _n_=1 then do;
      declare hash h(dataset:'STOPWORDS');
      declare hiter iter('h');
      h.definekey('StopWords');
      h.definedata('StopWords');
      h.definedone();
   end;
   set ANNU_COMMENTS;
      rc=iter.first();
      do while(rc=0);
          newComment=ifc(findw(newComment,strip(StopWords))>0,tranwrd(newComment,strip(StopWords),''),newComment);
          rc=iter.next();
      end;
    drop StopWords rc;
run;

答案 3 :(得分:0)

可以编写一个宏来生成清洁数据步骤。

此示例将圣林对tranwrd的使用与Lee的代码源混合在一起。

%macro flense (
  data=Commments,
  var=Comment, 
  newvar=CommentFlensed, 
  censor=BOW.StopWordsList,
  term=StopWords
);

  proc sql noprint;
    select quote(trim(&term),"'") into :sq_word1-;  * single quote the words to prevent possible macro evaluation later;
    from &censorData;

  data &out;
    set &data;
    &newvar = &var;

    %* for each censored word, generate an if statement
     * that checks if the word (or term) is present, and if so
     * removes the word from the new variable;

    %local i quoted_word;
    %do i = 1 %to &SQLOBS;

      %let quoted_word = &&&sq_word&i;

      if (indexw(&newvar.,&quoted_word)) then 
        &newvar = tranwrd(&newvar,&quoted_word,'');

    %end;
  run;

%mend;

%flense();