我有一个字符串向量,其中每个观察值都有一堆随机文本。例如:
"SOLAR MARS T-1500S TURBINE (13,300-BHP, NG, CENTRIFUGAL)"
"13-A, C-B, 1350 HP, NATURAL GAS COMPRESSOR ENGINE"
"3,000HP KVT-512 ENGINE"
"Engine 1, Caterpillar G3512 TALE : Emission Point:"
"DRESSER RAND - 2SLB ENGINE 1 RATED AT 3,200 BHP"
"Clark Engine #1 - 1550 HP natural gas-fired SI 2SLB"
"E1 - s/n WPW-01669 JJJJ""
我正在尝试提取马力值(任何4-5个五位数,后跟“ BHP”或“ HP”)。注意,某些观察值没有任何HP值。最终,我想返回以下内容:
13,330-BHP
1350HP
3,000HP
.
3,200 BHP
1550 HP
.
我在SAS中使用Regex的经验不足。有人对如何实现这一目标有任何想法吗?
谢谢
答案 0 :(得分:2)
将@Zachary Haber的注释转换为代码。
首先将文本转换为实际的SAS数据集。
data have;
input string $80.;
cards;
SOLAR MARS T-1500S TURBINE (13,300-BHP, NG, CENTRIFUGAL)
13-A, C-B, 1350 HP, NATURAL GAS COMPRESSOR ENGINE
3,000HP KVT-512 ENGINE
Engine 1, Caterpillar G3512 TALE : Emission Point:
DRESSER RAND - 2SLB ENGINE 1 RATED AT 3,200 BHP
Clark Engine #1 - 1550 HP natural gas-fired SI 2SLB
E1 - s/n WPW-01669 JJJJ
;
现在读取该数据集,并使用CALL PRXNEXT()查找第一个匹配项。添加了代码,还可以将结果转换为数字。
data want;
set have;
if _n_=1 then regexid = prxparse('(\d+(,\d+)*[ -]*?B?HP)');
retain regexid;
drop regexid;
length want $40;
start=1;
stop=length(string);
call prxnext(regexid,start,stop,string,position,len);
want=substrn(string,position,len);
HP = input(compress(want,',- BHP'),??32.);
run;
结果:
Obs want HP string
1 13,300-BHP 13300 SOLAR MARS T-1500S TURBINE (13,300-BHP, NG, CENTRIFUGAL)
2 1350 HP 1350 13-A, C-B, 1350 HP, NATURAL GAS COMPRESSOR ENGINE
3 3,000HP 3000 3,000HP KVT-512 ENGINE
4 . Engine 1, Caterpillar G3512 TALE : Emission Point:
5 3,200 BHP 3200 DRESSER RAND - 2SLB ENGINE 1 RATED AT 3,200 BHP
6 1550 HP 1550 Clark Engine #1 - 1550 HP natural gas-fired SI 2SLB
7 . E1 - s/n WPW-01669 JJJJ