Question

我正在搜索医学笔记，以捕获该短语的所有实例，尤其是“产生卡巴培南酶”。有时，此短语在字符串中可能会出现> 1次。通过一些研究，我认为PRXNEXT会最有意义，但是我很难让它完成我想做的事情。作为此字符串的示例：

如果需要丁胺卡那霉素的结果，请在ext通知微生物学实验室用于进一步测试的生物体将一直保存到美罗培南结果通过盘片扩散推定碳青霉烯酶生产cre获得 spmi for carba r pcr结果未确认产生碳青霉烯酶的cre

从上面的评论中，我想提取短语

假定的碳青霉烯酶生产

和

未确认产生碳青霉烯酶

我意识到我无法提取那些确切的短语，但可以提取出一些带有子字符串的短语。我一直在使用的代码在这里找到。这是我到目前为止的内容，但只捕获了第一个短语：

carba_cnt = count(as_comments,'carba','i');

if _n_ = 1 then do;
retain reg1 neg1;
 reg1 = prxparse("/ca[bepr]\w+ prod/");
end;

start = 1;
stop = length(as_comments);
position = 0;
length = 0;

/* Use PRXNEXT to find the first instance of the pattern, */
/* then use DO WHILE to find all further instances.       */
/* PRXNEXT changes the start parameter so that searching  */
/* begins again after the last match.                     */

call prxnext(reg1, start, stop, as_comments, position, length);

lastpos = 0;
 do while (position > 0);
 if lastpos then do; 
 length found $200;
 found = substr(as_comments,lastpos,position-lastpos);
 put found=;
  output;
 end;
 lastpos = position;

 call prxnext(reg1, start, stop, as_comments, position, length);
 end;

 if lastpos then do;
 found = substr(as_comments,lastpos);
 put found=;
  output;
 end;

Answer 1

使用PRXNEXT定位源中每次出现的正则表达式匹配是正确的。可以修改正则表达式模式以使用组捕获来搜索可选的前导“未确认”。最不可能发生“编码器失败”的情况是聚焦循环并提取对PRXNEXT的单个调用。

此示例使用模式/((not confirmed\s*)?(ca[bepr]\w+ prod))，每次匹配输出一行。

data have;
  id + 1;
  length comment $2000;
  infile datalines eof=done;
  do until (_infile_ = '----');
    input;
    if _infile_ ne '----' then 
      comment = catx(' ',comment,_infile_);
  end;
  done:
  if not missing(comment);
  datalines4;
if amikacin results are needed please notify microbiology lab at ext 
for further testing the organism will be held until meropenem result 
obtained by disc diffusion presumptive carbapenemase producing cre 
see spmi for carba r pcr results not confirmed carbapenemase producing cre
----
if amikacin results are needed please notify microbiology lab at ext 
for further testing the organism will be held until meropenem result 
obtained by disc diffusion conjectured carbapenems producing cre 
see spmi for carba r pcr results not confirmed carbapenemase producing cre
----
;;;;
run;

data want;
  set have;
  prx = prxparse('/((not confirmed\s*)?(ca[bepr]\w+ prod))/');

  _start_inout = 1;

  do hitnum = 1 by 1 until (pos=0);
    call prxnext (prx, _start_inout, length(comment), comment, pos, len);
    if len then do;
      content = substr(comment,pos,len);
      output;
    end;
  end;

  keep id hitnum content;
run;

奖金信息：prxparse不必位于if _n_=1块内。参见PRXPARSE docs

如果perl-regular-expression是常量或使用/ o选项，则Perl正则表达式仅被编译一次。连续调用PRXPARSE不会导致重新编译，但会返回已编译的正则表达式的regular-expression-id。此行为简化了代码，因为您不需要使用初始化块（IF _N_ = 1）来初始化Perl正则表达式。

使用PRXNEXT捕获关键字的所有实例

1 个答案: