需要根据关键词提取子串

时间:2015-08-14 16:36:56

标签: ruby regex string parsing

我有一个字符串(来自肥皂的一块cdata)看起来大致如下:

     "<![CDATA[XXX|^~\&
      KEY|^~\&|xxxxx|xxxxx^xxxx xxxxx 
      INFO||xxx|xxxxxx||xxxxx|xxxxxxx|xxxxxxx 
      INFO|||xxxxx||||xxxxxxxxx||||||||||xxxxxxxx              
      KEY|^~\&|xxxxxx|xxxxxxxxxx|xxxxxxxx    
      INFO||xx|xxxxxxxx||xxxxxxx|xxxxxx 
      INFO|||xxxx|x|||xxxxxxxxx|||||||x|||xxxxx|||xxxx||||||||||||||||||||||||xxxx
      KEY|^~\&|xxxxx|xxxxx^xxxx xxxxx 
      INFO||xxx|xxxxxx||xxxxx|xxxxxxx|xxxxxxx 
      INFO|||xxxxx||||xxxxxxxxx||||||||||xxxxxxxx  ]]>"

我想知道如何使用ruby安全地解析每个'KEY'部分的字符串。基本上我需要一个看起来像的刺痛:

  "KEY|^~\&|xxxxx|xxxxx^xxxx xxxxx 
  INFO||xxx|xxxxxx||xxxxx|xxxxxxx|xxxxxxx 
  INFO|||xxxxx||||xxxxxxxxx||||||||||xxxxxxxx"

每次都有'KEY'。关于最佳方式的想法?谢谢。

2 个答案:

答案 0 :(得分:0)

这是一种方法(使用简化示例):

str = 
"<![CDATA[XXX|^~\&
KEY|^~\&|x
INFO||x
INFO|||x
KEY|^~\&|x
INFO||xx|x
INFO|||x
KEY|^~\&|x
INFO||x
INFO|||x"

r = /
    ^KEY\b         # match KEY at beginning of line followed by word boundary
    .+?            # match any number of any character, lazily
    (?=\bKEY\b|\z) # match KEY bracketed by word boundaries or end of
                   # string, in positive lookahead
    /mx            # multiline and extended modes

str.scan r
  #=> ["KEY|^~&|x\nINFO||x\nINFO|||x\n",
  #    "KEY|^~&|x\nINFO||xx|x\nINFO|||x\n",
  #    "KEY|^~&|x\nINFO||x\nINFO|||x"] 

答案 1 :(得分:0)

不像正则表达式那样放松,但这可能适合你:

KEY(.+\n)+(?=\s+KEY)