使用RegEx使用标记拆分字符串

时间:2017-09-29 07:20:28

标签: regex

我需要帮助您将以下多标记字符串与<eyn><un>以及<an>

等标记分开
Your colleague <eyn id='test@test.com'>user</eyn> is now communicating with <un id='test@test.com'>user</un> from <an id='4442729'>test, Inc.</an>

1 个答案:

答案 0 :(得分:0)

使用Regex

由于可能出现的所有可能模糊的边缘情况,使用正则表达式解析HTML是不明智的,但似乎您可以控制HTML,因此您应该能够避免使用许多边缘情况regex police哭了。

提议的解决方案

我可能想要在一个操作中收集打开和关闭标记之间的整个标记,ID值和原始文本。

此正则表达式

<(eyn|un|an)\b(?=\s)(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\bid=('[^']*'|"[^"]*"|[^'"\s>]*))(?:[^>=]|='[^']*'|="[^"]*"|=[^'"\s]*)*\s?\/?>(.*?)<\/\w+>

enter image description here **要更好地查看图像,只需右键单击图像并在新窗口中选择视图

将执行以下操作

  • 找到所有eynunan标记
  • 要求代码具有ID属性
  • 允许ID属性值不加引号或由'"
  • 包围
  • 避免困难的边缘情况,使HTML中的模式匹配变得困难
  • 创建以下捕获组
    • 将整个标记从开放到关闭分组
    • 第1组标签名称
    • 第2组ID值
    • 第3组打开和关闭标记之间的原始文本

实施例

另见Live demo

示例文字

请注意嵌套在第二个文本块中的困难边缘情况。

Your colleague <eyn id='test@test.com'>user</eyn> is now communicating with <un id='test@test.com'>user</un> from <an id='4442729'>test, Inc.</an>

Your colleague <eyn onmouseover=' if ( 3 > a ) { var 
string=" <eyn id=NotTheDroidYouAreLookingFor>R2D2</eyn>; "; } '
  id='DesiredDroids'>This is the droid I'm looking for</eyn> is now communicating with <un id="test@test.com">user</un> from <an id=4442729>test, Inc.</an>

样本匹配

Match 1
Full match  15-49   `<eyn id='test@test.com'>user</eyn>`
Group 1.    16-19   `eyn`
Group 2.    23-38   `'test@test.com'`
Group 3.    39-43   `user`

Match 2
Full match  76-108  `<un id='test@test.com'>user</un>`
Group 1.    77-79   `un`
Group 2.    83-98   `'test@test.com'`
Group 3.    99-103  `user`

Match 3
Full match  114-146 `<an id='4442729'>test, Inc.</an>`
Group 1.    115-117 `an`
Group 2.    121-130 `'4442729'`
Group 3.    131-141 `test, Inc.`

Match 4
Full match  163-326 `<eyn onmouseover=' if ( 3 > a ) { var 
string=" <eyn id=NotTheDroidYouAreLookingFor>R2D2</eyn>; "; } '
  id='DesiredDroids'>This is the droid I'm looking for</eyn>`
Group 1.    164-167 `eyn`
Group 2.    271-286 `'DesiredDroids'`
Group 3.    287-320 `This is the droid I'm looking for`

Match 5
Full match  353-385 `<un id="test@test.com">user</un>`
Group 1.    354-356 `un`
Group 2.    360-375 `"test@test.com"`
Group 3.    376-380 `user`

Match 6
Full match  391-421 `<an id=4442729>test, Inc.</an>`
Group 1.    392-394 `an`
Group 2.    398-411 `4442729`
Group 3.    406-416 `test, Inc.`

解释

NODE                     EXPLANATION
--------------------------------------------------------------------------------
  <                        '<'
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    eyn                      'eyn'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    un                       'un'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    an                       'an'
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
--------------------------------------------------------------------------------
  (?=                      look ahead to see if there is:
--------------------------------------------------------------------------------
    \s                       whitespace (\n, \r, \t, \f, and " ")
--------------------------------------------------------------------------------
  )                        end of look-ahead
--------------------------------------------------------------------------------
  (?=                      look ahead to see if there is:
--------------------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the least amount
                             possible)):
--------------------------------------------------------------------------------
      [^>=]                    any character except: '>', '='
--------------------------------------------------------------------------------
     |                        OR
--------------------------------------------------------------------------------
      ='                       '=\''
--------------------------------------------------------------------------------
      [^']*                    any character except: ''' (0 or more
                               times (matching the most amount
                               possible))
--------------------------------------------------------------------------------
      '                        '\''
--------------------------------------------------------------------------------
     |                        OR
--------------------------------------------------------------------------------
      ="                       '="'
--------------------------------------------------------------------------------
      [^"]*                    any character except: '"' (0 or more
                               times (matching the most amount
                               possible))
--------------------------------------------------------------------------------
      "                        '"'
--------------------------------------------------------------------------------
     |                        OR
--------------------------------------------------------------------------------
      =                        '='
--------------------------------------------------------------------------------
      [^'"]                    any character except: ''', '"'
--------------------------------------------------------------------------------
      [^\s>]*                  any character except: whitespace (\n,
                               \r, \t, \f, and " "), '>' (0 or more
                               times (matching the most amount
                               possible))
--------------------------------------------------------------------------------
    )*?                      end of grouping
--------------------------------------------------------------------------------
    \b                       the boundary between a word char (\w)
                             and something that is not a word char
--------------------------------------------------------------------------------
    id=                      'id='
--------------------------------------------------------------------------------
    (                        group and capture to \2:
--------------------------------------------------------------------------------
      '                        '\''
--------------------------------------------------------------------------------
      [^']*                    any character except: ''' (0 or more
                               times (matching the most amount
                               possible))
--------------------------------------------------------------------------------
      '                        '\''
--------------------------------------------------------------------------------
     |                        OR
--------------------------------------------------------------------------------
      "                        '"'
--------------------------------------------------------------------------------
      [^"]*                    any character except: '"' (0 or more
                               times (matching the most amount
                               possible))
--------------------------------------------------------------------------------
      "                        '"'
--------------------------------------------------------------------------------
     |                        OR
--------------------------------------------------------------------------------
      [^'"\s>]*                any character except: ''', '"',
                               whitespace (\n, \r, \t, \f, and " "),
                               '>' (0 or more times (matching the
                               most amount possible))
--------------------------------------------------------------------------------
    )                        end of \2
--------------------------------------------------------------------------------
  )                        end of look-ahead
--------------------------------------------------------------------------------
  (?:                      group, but do not capture (0 or more times
                           (matching the most amount possible)):
--------------------------------------------------------------------------------
    [^>=]                    any character except: '>', '='
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    ='                       '=\''
--------------------------------------------------------------------------------
    [^']*                    any character except: ''' (0 or more
                             times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    '                        '\''
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    ="                       '="'
--------------------------------------------------------------------------------
    [^"]*                    any character except: '"' (0 or more
                             times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    "                        '"'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    =                        '='
--------------------------------------------------------------------------------
    [^'"\s]*                 any character except: ''', '"',
                             whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
  )*                       end of grouping
--------------------------------------------------------------------------------
  \s?                      whitespace (\n, \r, \t, \f, and " ")
                           (optional (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  \/?                      '/' (optional (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  >                        '>'
--------------------------------------------------------------------------------
  (                        group and capture to \3:
--------------------------------------------------------------------------------
    .*?                      any character except \n (0 or more times
                             (matching the least amount possible))
--------------------------------------------------------------------------------
  )                        end of \3
--------------------------------------------------------------------------------
  <                        '<'
--------------------------------------------------------------------------------
  \/                       '/'
--------------------------------------------------------------------------------
  \w+                      word characters (a-z, A-Z, 0-9, _) (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  >                        '>'