SAS RegEx字符串提取

时间:2020-04-09 16:16:52

标签: regex sas extract

我有一个字符串向量,其中每个观察值都有一堆随机文本。例如:

  "SOLAR MARS T-1500S TURBINE (13,300-BHP, NG, CENTRIFUGAL)"
  "13-A, C-B, 1350 HP, NATURAL GAS COMPRESSOR ENGINE"
  "3,000HP KVT-512 ENGINE"
  "Engine 1, Caterpillar G3512 TALE : Emission Point:"
  "DRESSER RAND - 2SLB ENGINE 1 RATED AT 3,200 BHP"
  "Clark Engine #1 - 1550 HP natural gas-fired SI 2SLB"
  "E1 - s/n WPW-01669 JJJJ""

我正在尝试提取马力值(任何4-5个五位数,后跟“ BHP”或“ HP”)。注意,某些观察值没有任何HP值。最终,我想返回以下内容:

  13,330-BHP
  1350HP
  3,000HP
  .
  3,200 BHP
  1550 HP
  .

我在SAS中使用Regex的经验不足。有人对如何实现这一目标有任何想法吗?

谢谢

1 个答案:

答案 0 :(得分:2)

将@Zachary Haber的注释转换为代码。

首先将文本转换为实际的SAS数据集。

data have;
  input string $80.;
cards;
SOLAR MARS T-1500S TURBINE (13,300-BHP, NG, CENTRIFUGAL)
13-A, C-B, 1350 HP, NATURAL GAS COMPRESSOR ENGINE
3,000HP KVT-512 ENGINE
Engine 1, Caterpillar G3512 TALE : Emission Point:
DRESSER RAND - 2SLB ENGINE 1 RATED AT 3,200 BHP
Clark Engine #1 - 1550 HP natural gas-fired SI 2SLB
E1 - s/n WPW-01669 JJJJ
;

现在读取该数据集,并使用CALL PRXNEXT()查找第一个匹配项。添加了代码,还可以将结果转换为数字。

data want;
  set have;
  if _n_=1 then regexid = prxparse('(\d+(,\d+)*[ -]*?B?HP)');
  retain regexid;
  drop regexid;
  length want $40;
  start=1;
  stop=length(string);
  call prxnext(regexid,start,stop,string,position,len);
  want=substrn(string,position,len);
  HP = input(compress(want,',- BHP'),??32.);
run;

结果:

Obs       want         HP     string

 1     13,300-BHP    13300    SOLAR MARS T-1500S TURBINE (13,300-BHP, NG, CENTRIFUGAL)
 2     1350 HP        1350    13-A, C-B, 1350 HP, NATURAL GAS COMPRESSOR ENGINE
 3     3,000HP        3000    3,000HP KVT-512 ENGINE
 4                       .    Engine 1, Caterpillar G3512 TALE : Emission Point:
 5     3,200 BHP      3200    DRESSER RAND - 2SLB ENGINE 1 RATED AT 3,200 BHP
 6     1550 HP        1550    Clark Engine #1 - 1550 HP natural gas-fired SI 2SLB
 7                       .    E1 - s/n WPW-01669 JJJJ