我有一个包含学位类型(例如博士学位)的多学位学位课程列表,我想删除学位类型并只保留课程名称。例如:
Master of Science in Building Performance and Diagnostics
Master of Science in Computational Design
Master of Science in Sustainable Design
Master of Urban Design
PhD in Architecture
我正在尝试使用scan将字符串拆分为“in”并提取后面的所有文本,但我不明白我得到的结果。当我使用-1(从右边开始)作为起点我得到:
data want;
format new_prog old_prog $200.;
set have (rename = (program = old_prog));
if count(old_prog, " in ") ge 1 then new_prog = scan(old_prog, -1, "in ");
run;
new_prog old_prog
tecture Master of Science in Architecture
g Master of Science in Sustainable Design
cs Master of Science in Building Performance and Diagnostics
t Master of Science in Architecture-Engineering and Construction Management
我不认为这会起作用,因为我想要“in”之后的整个字符串,而不仅仅是下一个单词,但即使我使用scan(old_prog,2,“in”)我希望这能给我下一个词,但它似乎给了我随机的东西,例如:
program old_prog
Bu PhD in Building Performance and Diagnostics
of Master of Science in Architecture-Engineering and Construction Management
Computat PhD in Computational Design
of Master of Science in Sustainable Design
答案 0 :(得分:1)
数据有;
输入@ 1 old_prog $ 60。;
如果找到(old_prog,' in')则new_prog = substr(old_prog,1,find(old_prog,' in'));
否则new_prog = old_prog;
datalines;
建筑性能和诊断方面的卓越硕士
计算设计理学硕士
可持续设计理学硕士
城市设计硕士
建筑学博士
;
运行;
proc print data = have;
运行;
Obs old_prog new_prog
1建筑性能和诊断方面的优秀硕士成名硕士
2计算设计理学硕士理学硕士
3可持续设计理学硕士理学硕士
4城市设计硕士城市设计硕士
5建筑学博士博士
答案 1 :(得分:0)
以下是使用substr和index的方法。
data want;
format new_prog old_prog $200.;
infile datalines dsd missover;
input old_prog :$200.;
if count(old_prog, " in ") ge 1 then new_prog = substr(old_prog,index(old_prog,"in") + 3);
datalines;
Master of Science in Building Performance and Diagnostics
Master of Science in Computational Design
Master of Science in Sustainable Design
Master of Urban Design
PhD in Architecture
;
run;
索引会在"中找到"的位置。在字符串中并将其传递给substr以开始将变量从此位置+ 3切换到字符串的末尾。
答案 2 :(得分:0)
使用substr和index函数考虑数据步骤和proc sql解决方案:
data want;
set have;
if count(old_prog, " in ") ge 1
then new_prog = substr(old_prog, index(old_prog, "in")+3);
run;
proc sql;
create table want as
select *,
case when index(old_prog, "in") > 0
then substr(old_prog, index(old_prog, "in")+3)
else old_prog
end as new_prog
from want;
run;
答案 3 :(得分:0)
您有其他人建议的许多有效选项。我可以建议一种REGEX方式来获得你想要的东西吗?
我注意到您的样本数据中有三种模式:
在处理文本中的模式时,REGEX非常有用,因为您可以定义要查找的文本模式,并在模式为true时提取文本。
有关详细信息,请参阅代码中的注释:
/* Dropping pattern ids because they are not useful in data */
data have (drop=pattern_in pattern_of);
/* Reading in the raw data from datalines */
input @1 old_prog $60.;
/* Compiling first sample based on "in " pattern. */
pattern_in = prxparse('/in ([\w\s]*)/');
/* Compiling first sample based on "of " pattern. */
pattern_of = prxparse('/of ([\w\s]*)/');
/*If the string satisfied the patter with "in " */
if prxmatch(pattern_in,old_prog) then
/* Then extract capture buffer after "in " pattern */
new_prog=prxposn(pattern_in,1,old_prog);
/*If the string satisfied the patter with "of " after it didn't find patter "in "*/
else if prxmatch(pattern_of,old_prog) then
/* Then extract capture buffer after "of " pattern */
new_prog=prxposn(pattern_of,1,old_prog);
datalines;
Master of Scinence in Building Performance and Diagnostics
Master of Science in Computational Design
Master of Science in Sustainable Design
Master of Urban Design
PhD in Architecture
;
PROC PRINT DATA=have;
run;