SAS在字符串中查找大写单词

时间:2016-08-10 23:47:11

标签: string sas

我有一个字符串,其中包含一个大写字母的单词。我想使用SAS将这一个单词提取到一个新变量中。

我想我需要找到一种方法来编写一个包含两个或更多大写字母的单词(因为句子的开头会以大写字母开头)。

即。如何创建变量' word':

data example;

    length txtString $50;

    length word $20;

    infile datalines dlm=',';

    input txtString $ word $;

datalines;

This is one EXAMPLE. Of what I need.,EXAMPLE

THIS is another.,THIS

etc ETC,ETC

;

run;

希望有人能提供帮助,问题很明确

提前致谢

2 个答案:

答案 0 :(得分:0)

考虑使用负向lookbehind的正则表达式匹配/替换以包括两种类型的匹配:

  1. 连续大写单词后跟一个至少包含两个字符的空格(以避免句子开头的标题案例):(([A-Z ]){2,})
  2. 连续大写单词后跟一个至少包含两个字符的句点:(以避免句子开头的标题案例):(([A-Z.]){2,})
  3. CAVEAT:此解决方案有效,但 I 文章也匹配,从技术上讲它是一个有效匹配,因为它也是一个全大写的单字。作为英语中唯一的类型,请考虑tranwrd()替换这种特殊情况。实际上,相关地,此解决方案匹配所有大写单词。

    data example;
        length txtString $50;
        length word $20;
        infile datalines dlm=',';
        input txtString $ word $;
    datalines;
    This is one EXAMPLE. Of what I need.,EXAMPLE
    THIS is another.,THIS
    etc ETC,ETC
    ;
    run;
    
    data example;
        set example;
        pattern_num = prxparse("s/(?!(([A-Z ]){2,})|(([A-Z.]){2,})).//");
        wordextract = prxchange(pattern_num, -1, txtString); 
    
        wordextract = tranwrd(wordextract, " I ", "");
        drop pattern_num;
    run;
    
    txtString                               word     wordextract
    This is one EXAMPLE. Of what I need.    EXAMPLE  EXAMPLE
    THIS is another.                        THIS     THIS
    etc ETC                                 ETC      ETC
    

答案 1 :(得分:0)

SAS有一个prxsubstr()函数调用,用于查找与给定字符串中给定正则表达式模式匹配的子字符串的起始位置和长度。这是使用prxsubstr()函数调用的示例解决方案:

data solution;
    set example;

    /* Build a regex pattern of the word to search for, and hang on to it */
    /* (The regex below means: word boundary, then two or more capital letters, 
    then word boundary. Word boundary here means the start or the end of a string
    of letters, digits and/or underscores.) */
    if _N_ = 1 then pattern_num = prxparse("/\b[A-Z]{2,}\b/");
    retain pattern_num;

    /* Get the starting position and the length of the word to extract */
    call prxsubstr(pattern_num, txtString, mypos, mylength);

    /* If a word matching the regex pattern is found, extract it */
    if mypos ^= 0 then word = substr(txtString, mypos, mylength);
run;

SAS prxsubstr()文档:http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002295971.htm

正则表达式单词边界信息:http://www.regular-expressions.info/wordboundaries.html