我有一个字符串,其中包含一个大写字母的单词。我想使用SAS将这一个单词提取到一个新变量中。
我想我需要找到一种方法来编写一个包含两个或更多大写字母的单词(因为句子的开头会以大写字母开头)。
即。如何创建变量' word':
data example;
length txtString $50;
length word $20;
infile datalines dlm=',';
input txtString $ word $;
datalines;
This is one EXAMPLE. Of what I need.,EXAMPLE
THIS is another.,THIS
etc ETC,ETC
;
run;
希望有人能提供帮助,问题很明确
提前致谢
答案 0 :(得分:0)
考虑使用负向lookbehind的正则表达式匹配/替换以包括两种类型的匹配:
(([A-Z ]){2,})
(([A-Z.]){2,})
CAVEAT:此解决方案有效,但 I 文章也匹配,从技术上讲它是一个有效匹配,因为它也是一个全大写的单字。作为英语中唯一的类型,请考虑tranwrd()
替换这种特殊情况。实际上,相关地,此解决方案匹配所有大写单词。
data example;
length txtString $50;
length word $20;
infile datalines dlm=',';
input txtString $ word $;
datalines;
This is one EXAMPLE. Of what I need.,EXAMPLE
THIS is another.,THIS
etc ETC,ETC
;
run;
data example;
set example;
pattern_num = prxparse("s/(?!(([A-Z ]){2,})|(([A-Z.]){2,})).//");
wordextract = prxchange(pattern_num, -1, txtString);
wordextract = tranwrd(wordextract, " I ", "");
drop pattern_num;
run;
txtString word wordextract
This is one EXAMPLE. Of what I need. EXAMPLE EXAMPLE
THIS is another. THIS THIS
etc ETC ETC ETC
答案 1 :(得分:0)
SAS有一个prxsubstr()函数调用,用于查找与给定字符串中给定正则表达式模式匹配的子字符串的起始位置和长度。这是使用prxsubstr()函数调用的示例解决方案:
data solution;
set example;
/* Build a regex pattern of the word to search for, and hang on to it */
/* (The regex below means: word boundary, then two or more capital letters,
then word boundary. Word boundary here means the start or the end of a string
of letters, digits and/or underscores.) */
if _N_ = 1 then pattern_num = prxparse("/\b[A-Z]{2,}\b/");
retain pattern_num;
/* Get the starting position and the length of the word to extract */
call prxsubstr(pattern_num, txtString, mypos, mylength);
/* If a word matching the regex pattern is found, extract it */
if mypos ^= 0 then word = substr(txtString, mypos, mylength);
run;
SAS prxsubstr()文档:http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002295971.htm
正则表达式单词边界信息:http://www.regular-expressions.info/wordboundaries.html