SAS如何从字符串

时间:2015-12-28 15:41:57

标签: sas

我有一个包含学位类型(例如博士学位)的多学位学位课程列表,我想删除学位类型并只保留课程名称。例如:

Master of Science in Building Performance and Diagnostics
Master of Science in Computational Design  
Master of Science in Sustainable Design 
Master of Urban Design 
PhD in Architecture 

我正在尝试使用scan将字符串拆分为“in”并提取后面的所有文本,但我不明白我得到的结果。当我使用-1(从右边开始)作为起点我得到:

data want; 
    format new_prog old_prog $200.; 
    set have (rename = (program = old_prog)); 
    if count(old_prog, " in ") ge 1 then new_prog = scan(old_prog, -1, "in "); 
run; 


new_prog  old_prog
tecture   Master of Science in Architecture 
g         Master of Science in Sustainable Design 
cs        Master of Science in Building Performance and Diagnostics 
t         Master of Science in Architecture-Engineering and Construction Management 

我不认为这会起作用,因为我想要“in”之后的整个字符串,而不仅仅是下一个单词,但即使我使用scan(old_prog,2,“in”)我希望这能给我下一个词,但它似乎给了我随机的东西,例如:

program  old_prog 
Bu       PhD in Building Performance and Diagnostics 
of       Master of Science in Architecture-Engineering and Construction Management 
Computat PhD in Computational Design 
of       Master of Science in Sustainable Design 

4 个答案:

答案 0 :(得分:1)

数据有;
输入@ 1 old_prog $ 60。;
如果找到(old_prog,' in')则new_prog = substr(old_prog,1,find(old_prog,' in'));
否则new_prog = old_prog;
datalines;
建筑性能和诊断方面的卓越硕士
计算设计理学硕士 可持续设计理学硕士
城市设计硕士
建筑学博士 ;
运行;
proc print data = have;
运行;

Obs old_prog new_prog
1建筑性能和诊断方面的优秀硕士成名硕士
2计算设计理学硕士理学硕士
3可持续设计理学硕士理学硕士
4城市设计硕士城市设计硕士
5建筑学博士博士

答案 1 :(得分:0)

以下是使用substr和index的方法。

data want;
format new_prog old_prog $200.;
infile datalines dsd missover;
input old_prog :$200.;

if count(old_prog, " in ") ge 1 then new_prog = substr(old_prog,index(old_prog,"in") + 3); 

datalines;
Master of Science in Building Performance and Diagnostics
Master of Science in Computational Design  
Master of Science in Sustainable Design 
Master of Urban Design 
PhD in Architecture 
;
run;

索引会在"中找到"的位置。在字符串中并将其传递给substr以开始将变量从此位置+ 3切换到字符串的末尾。

答案 2 :(得分:0)

使用substrindex函数考虑数据步骤和proc sql解决方案:

data want;
    set have;
    if count(old_prog, " in ") ge 1 
       then new_prog = substr(old_prog, index(old_prog, "in")+3);
run;


proc sql;
    create table want as
    select *, 
    case when index(old_prog, "in") > 0 
         then substr(old_prog, index(old_prog, "in")+3)
         else old_prog
    end as new_prog
    from want;
run;

答案 3 :(得分:0)

您有其他人建议的许多有效选项。我可以建议一种REGEX方式来获得你想要的东西吗?

我注意到您的样本数据中有三种模式:

  1. 您尝试在代码示例中使用的典型分隔符为“in”
  2. 当不使用典型的分离器时,则使用另一个分离器“of”。
  3. 学位类型可以拼写不同(硕士学位,理学硕士,博士)。
  4. 在处理文本中的模式时,REGEX非常有用,因为您可以定义要查找的文本模式,并在模式为true时提取文本。

    有关详细信息,请参阅代码中的注释:

    /* Dropping pattern ids because they are not useful in data */
    data have (drop=pattern_in pattern_of);
        /* Reading in the raw data from datalines */
        input @1 old_prog $60.;
    
        /* Compiling first sample based on "in " pattern. */
        pattern_in = prxparse('/in ([\w\s]*)/');
    
        /* Compiling first sample based on "of " pattern. */
        pattern_of = prxparse('/of ([\w\s]*)/');
    
        /*If the string satisfied the patter with "in " */
        if prxmatch(pattern_in,old_prog) then 
        /* Then extract capture buffer after "in " pattern */
        new_prog=prxposn(pattern_in,1,old_prog);
    
        /*If the string satisfied the patter with "of " after it didn't find patter "in "*/
        else if prxmatch(pattern_of,old_prog) then 
        /* Then extract capture buffer after "of " pattern */
        new_prog=prxposn(pattern_of,1,old_prog);
        datalines;
    Master of Scinence in Building Performance and Diagnostics
    Master of Science in Computational Design 
    Master of Science in Sustainable Design 
    Master of Urban Design 
    PhD in Architecture 
    ;
    
    PROC PRINT DATA=have;
    run;
    

    结果: enter image description here