Question

我有一个包含学位类型（例如博士学位）的多学位学位课程列表，我想删除学位类型并只保留课程名称。例如：

Master of Science in Building Performance and Diagnostics
Master of Science in Computational Design  
Master of Science in Sustainable Design 
Master of Urban Design 
PhD in Architecture

我正在尝试使用scan将字符串拆分为“in”并提取后面的所有文本，但我不明白我得到的结果。当我使用-1（从右边开始）作为起点我得到：

data want; 
    format new_prog old_prog $200.; 
    set have (rename = (program = old_prog)); 
    if count(old_prog, " in ") ge 1 then new_prog = scan(old_prog, -1, "in "); 
run; 


new_prog  old_prog
tecture   Master of Science in Architecture 
g         Master of Science in Sustainable Design 
cs        Master of Science in Building Performance and Diagnostics 
t         Master of Science in Architecture-Engineering and Construction Management

我不认为这会起作用，因为我想要“in”之后的整个字符串，而不仅仅是下一个单词，但即使我使用scan（old_prog，2，“in”）我希望这能给我下一个词，但它似乎给了我随机的东西，例如：

program  old_prog 
Bu       PhD in Building Performance and Diagnostics 
of       Master of Science in Architecture-Engineering and Construction Management 
Computat PhD in Computational Design 
of       Master of Science in Sustainable Design

Answer 1

数据有;
输入@ 1 old_prog $ 60。;
如果找到（old_prog，＆＃39; in＆＃39;）则new_prog = substr（old_prog，1，find（old_prog，＆＃39; in＆＃39;））;
否则new_prog = old_prog;
datalines;
建筑性能和诊断方面的卓越硕士
计算设计理学硕士可持续设计理学硕士
城市设计硕士
建筑学博士 ;
运行;
proc print data = have;
运行;

Obs old_prog new_prog
1建筑性能和诊断方面的优秀硕士成名硕士
2计算设计理学硕士理学硕士
3可持续设计理学硕士理学硕士
4城市设计硕士城市设计硕士
5建筑学博士博士

Answer 2

以下是使用substr和index的方法。

data want;
format new_prog old_prog $200.;
infile datalines dsd missover;
input old_prog :$200.;

if count(old_prog, " in ") ge 1 then new_prog = substr(old_prog,index(old_prog,"in") + 3); 

datalines;
Master of Science in Building Performance and Diagnostics
Master of Science in Computational Design  
Master of Science in Sustainable Design 
Master of Urban Design 
PhD in Architecture 
;
run;

索引会在＆＃34;中找到＆＃34;的位置。在字符串中并将其传递给substr以开始将变量从此位置+ 3切换到字符串的末尾。

Answer 3

使用substr和index函数考虑数据步骤和proc sql解决方案：

data want;
    set have;
    if count(old_prog, " in ") ge 1 
       then new_prog = substr(old_prog, index(old_prog, "in")+3);
run;


proc sql;
    create table want as
    select *, 
    case when index(old_prog, "in") > 0 
         then substr(old_prog, index(old_prog, "in")+3)
         else old_prog
    end as new_prog
    from want;
run;

Answer 4

您有其他人建议的许多有效选项。我可以建议一种REGEX方式来获得你想要的东西吗？

我注意到您的样本数据中有三种模式：

您尝试在代码示例中使用的典型分隔符为“in”
当不使用典型的分离器时，则使用另一个分离器“of”。
学位类型可以拼写不同（硕士学位，理学硕士，博士）。

在处理文本中的模式时，REGEX非常有用，因为您可以定义要查找的文本模式，并在模式为true时提取文本。

有关详细信息，请参阅代码中的注释：

/* Dropping pattern ids because they are not useful in data */
data have (drop=pattern_in pattern_of);
    /* Reading in the raw data from datalines */
    input @1 old_prog $60.;

    /* Compiling first sample based on "in " pattern. */
    pattern_in = prxparse('/in ([\w\s]*)/');

    /* Compiling first sample based on "of " pattern. */
    pattern_of = prxparse('/of ([\w\s]*)/');

    /*If the string satisfied the patter with "in " */
    if prxmatch(pattern_in,old_prog) then 
    /* Then extract capture buffer after "in " pattern */
    new_prog=prxposn(pattern_in,1,old_prog);

    /*If the string satisfied the patter with "of " after it didn't find patter "in "*/
    else if prxmatch(pattern_of,old_prog) then 
    /* Then extract capture buffer after "of " pattern */
    new_prog=prxposn(pattern_of,1,old_prog);
    datalines;
Master of Scinence in Building Performance and Diagnostics
Master of Science in Computational Design 
Master of Science in Sustainable Design 
Master of Urban Design 
PhD in Architecture 
;

PROC PRINT DATA=have;
run;

结果：

SAS如何从字符串

4 个答案: