SAS Perl正则表达式:如何编写正确的语法?

时间:2014-04-18 15:53:29

标签: regex sas

我有一些复杂的字符串解析,由于字符串值不一致,使用常规SAS函数很难完成;结果是 我想我需要使用Perl正则表达式。下面有4个变量(价格,日期,大小,包),我必须使用文本字符串的一部分创建。我无法正确理解语法 - 我是正则表达式的新手。

以下是一个示例数据集。

data have;
infile cards truncover;
  input text $80.;
cards;
acq_newsale_0_CartChat_0_Flash_1192014.jpg
acq_old_3x_GadgetPotomac_7999_Flash_112014.swf
acq_sale_3xconoffer_8999_nacpg_2102014.sfw
acq_is_3X_ItsEasy_8999_NACPG_Flash_272014_728x90.hgp
awa_os_3xMZ1_FiOSPresents_FF_160x600_12252014.mov
awa_fs_0_TWCMLP_v2_switch_0_0_Static_462014_300x250.jpg
acq_fi_2x_incrediblemz1_7999_nac_flash_1192014_160x600.swf
acq_fio_3x_bringhome_6499_0_flash_12162013_728x90.swf
;run;

/ 第一个变量是价格,它通常位于字符串的结尾或中间附近 /

data want;
set have;
  price =(input(prxchange('s/(\w+)_(\d+)_(\w+)/$2/',-1,text),8.))/100;
  format price dollar8.2;
run;

Using the data set above I need to have this result:

价格 0 79.99 89.99 89.99 79.99 64.99

/ 日期始终是一系列连续数字。 6,7或8使用|这意味着'或'我以为我能够这样拉 /

data want;
set have;
date=prxparse('/\d\d\d\d\d\d|\d\d\d\d\d\d\d|\d\d\d\d\d\d\d\d/',text);
run;

使用上面的数据我需要得到这个结果:

日期 1192014 112014 2102014 272014 12252014 462014 1192014 12162013

/ *对于大小,子串中间始终有一个'x',后面跟两个或三个数字* /

data want;
set have;
  size=prxparse('/(\w+)_(\d+)'x'(\d+)_(\w+)/',text);
run;

尺寸 为728x90 160x600的 300x250的 160x600的 728x90的

/ *这通常位于字符串的开头。它始终是一个数字后跟一个数字,它后面跟不是附加数字,但也可以只有0. * /

data want;
set have;
  Bundle=prxparse('/(\d+)'x'',text);
run;

捆绑 0 3X 3X 3X 3X 0 2倍 3×

我正在寻找的最终产品应如下所示:

Text    Date    price   Size    Bundle
acq_newsale_0_CartChat_0_Flash_1192014.jpg  1192014 0       0
acq_old_3x_GadgetPotomac_7999_Flash_112014.swf  112014  79.99       3x
acq_sale_3xconoffer_8999_nacpg_2102014.sfw  2102014 89.99       3x
acq_is_3X_ItsEasy_8999_NACPG_Flash_272014_728x90.hgp    272014  89.99   728x90  3X
awa_os_3xMZ1_FiOSPresents_FF_160x600_12252014.mov   12252014        160x600 3x
awa_fs_0_TWCMLP_v2_switch_0_0_Static_462014_300x250.jpg 462014      300x250 0
acq_fi_2x_incrediblemz1_7999_nac_flash_1192014_160x600.swf  1192014 79.99   160x600 2x
acq_fio_3x_bringhome_6499_0_flash_12162013_728x90.swf   12162013    64.99   728x90  3

X

1 个答案:

答案 0 :(得分:3)

如果您正在提取,请不要使用PRXCHANGE。使用PRXPARSE,PRXMATCH和PRXPOSN。

样本用法,日期:

data have;
infile cards truncover;
  input text $80.;
cards;
acq_newsale_0_CartChat_0_Flash_1192014.jpg
acq_old_3x_GadgetPotomac_7999_Flash_112014.swf
acq_sale_3xconoffer_8999_nacpg_2102014.sfw
acq_is_3X_ItsEasy_8999_NACPG_Flash_272014_728x90.hgp
awa_os_3xMZ1_FiOSPresents_FF_160x600_12252014.mov
awa_fs_0_TWCMLP_v2_switch_0_0_Static_462014_300x250.jpg
acq_fi_2x_incrediblemz1_7999_nac_flash_1192014_160x600.swf
acq_fio_3x_bringhome_6499_0_flash_12162013_728x90.swf
;
run;

data want;
set have;
rx_date = prxparse('~(\d{6,8})~io');
rc_date = prxmatch(rx_date,text);
if rc_date then datevar = prxposn(rx_date,1,text);
run;

只需在parens中包含要提取的部分(在本例中为所有部分)。

日期很简单 - 如你所说,6-8个数字。其他人可能更难。你可以找到的3x等位,取决于你需要多严格;我认为你的价格很难找到。您需要能够更好地阐明规则。 "迈向开始"不是正则表达式规则。 "第二组数字"是; "倒数第二集"也许可行。我会看看能不能算出一些。

在您的示例数据中,这有效。我特别不喜欢价格搜索;一个人可能会因为一组更复杂的数据而失败。你可以想出为自己添加小数。

data have;
infile cards truncover;
  input text $80.;
cards;
acq_newsale_0_CartChat_0_Flash_1192014.jpg
acq_old_3x_GadgetPotomac_7999_Flash_112014.swf
acq_sale_3xconoffer_8999_nacpg_2102014.sfw
acq_is_3X_ItsEasy_8999_NACPG_Flash_272014_728x90.hgp
awa_os_3xMZ1_FiOSPresents_FF_160x600_12252014.mov
awa_fs_0_TWCMLP_v2_switch_0_0_Static_462014_300x250.jpg
acq_fi_2x_incrediblemz1_7999_nac_flash_1192014_160x600.swf
acq_fio_3x_bringhome_6499_0_flash_12162013_728x90.swf
blahblah :23 blahblah
blahblahblah 23 blah blah
;
run;

data want;
set have;
rx_date   = prxparse('~_(\d{6,8})[_\.]~io');
rx_price  = prxparse('~_(\d+)_.*?(?=_\d+[_\.]).*?(?!_\d+[_\.])~io');
rx_bundle = prxparse('~(?!_\d+_)_(\dx)~io');
rx_size   = prxparse('~_(\d+x\d+)[_\.]~io');
rx_adnum  = prxparse('~\s:?(\d\d)\s~io');

rc_date   = prxmatch(rx_date,text);
rc_price  = prxmatch(rx_price,text);
rc_bundle = prxmatch(rx_bundle,text);
rc_size   = prxmatch(rx_size,text);
rc_adnum  = prxmatch(rx_adnum,text);

if rc_date   then datevar = prxposn(rx_date,1,text);
if rc_price  then price = prxposn(rx_price,1,text);
if rc_bundle then bundle = prxposn(rx_bundle,1,text);
if rc_size   then size   = prxposn(rx_size,1,text);
if rc_adnum  then adnum  = prxposn(rx_adnum,1,text);

run;