正则表达式在两个字符串之间找到多个匹配项

时间:2016-04-30 19:33:59

标签: c# regex string

以下是我正在尝试使用的字符串的链接:http://pastebin.com/raw/TRKbqGxs

是的,我知道正则表达式不是最好的解析HTML,但我想在这个项目中使用它。我现在不想使用HTML Ability Pack。

我的主要兴趣是这些行:

data-screen-name="thanhbach195" data-name="Mai Thanh Bách" data-protected="false">

data-screen-name="zeref980" data-name="Yan Naung Htet" data-protected="false">

我想在以下文本块data-screen-name="" data-name=之间导出数据。

基本上,在这种情况下thanhbach195zeref980

我尝试了以下正则表达式:string reg = "data-screen-name=\"(.*)\" data-name=\"";

但由于某种原因,我没有得到多场比赛的答案。事实上,它似乎并没有提取我想要的字符串。

如果有人可以帮我写一个正确的正则表达式来提取我在上面上传的字符串中的那两个字符串之间的所有匹配,我将不胜感激。

private List<string> getUsers(string str)
        {
            List<string> users = new List<string>();
            string reg = "data-screen-name=\"(.*)\" data-name=\"";
            MatchCollection mc = Regex.Matches(str, reg);
            foreach(Match m in mc)
            {
                users.Add(m.Groups[1].Value);
            }
            return users;
        }

此代码每次返回相同的匹配(我相信它是第一个)。

2 个答案:

答案 0 :(得分:0)

尝试一个非贪婪的命名正则表达式,它终止于下一个双引号而不是贪婪的匹配

string reg = "data-screen-name=\"(?<dataScreenName>[^\"]+)\" data-name=\"(?<dataName>[^\"]+\"";

这将允许您使用

foreach(Match m in mc)
{
    users.Add(m.Groups["dataScreenName"].Value);
}

答案 1 :(得分:0)

声明

你应该真正使用HTML解析引擎,因为有许多模糊的边缘情况,正则表达式无法容纳。但我不是你的妈妈,所以我不会告诉你如何过你的生活。

由于您似乎可以控制源文本,因此您可能能够完全避免模糊的边缘情况。

描述

此正则表达式将执行以下操作:

  • 找到所有div代码
  • 确保找到的每个div代码都包含class="user-actions,包含或不包含引用
  • data-screen-namedata-namedata-protected的值捕获到自己的捕获组中
  • 允许属性/值集以任何顺序显示
  • 允许值周围的引号是可选的,因此您可以使用单引号,双引号或无引号
  • 从值中清除引号,以便获得原始值
  • 避免了正则表达式警察在匹配HTML时遇到的许多混乱边缘案例

正则表达式

<div\b(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?class=['"]?user-actions)(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?data-screen-name=(['"]?)(.*?)\1(?:\s|>))(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?data-name=(['"]?)(.*?)\3(?:\s|>))(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?data-protected=(['"]?)(.*?)\5(?:\s|>))(?:[^>=]|='(?:[^']|\\')*'|="(?:[^"]|\\")*"|=[^'"][^\s>]*)*\s*>

我建议使用不区分大小写的标记。

实施例

摘自源文字

示例文本的一小部分

<div class="user-actions btn-group not-following not-muting protected" data-user-id="726459723365502976"
data-screen-name="Just__Kidding__" data-name="Chaw Chin Fong" data-protected="true">

直播示例

显示整个文件的一大部分,因为在线工具陷入了巨大的文本串。

https://regex101.com/r/bY1kH8/1

捕获论坛

  • 组0获取整个开始div标签
  • 如果data-screen-name
  • 附近有一个,则会为第1组获取分配报价
  • 第2组获取data-screen-name值,不包括任何引号
  • 如果data-name
  • 附近有一个,则为第3组获取分配报价
  • 第4组获取data-name值,不包括任何引号
  • 如果data-protected
  • 附近有一个,则会为第5组获取分配报价
  • 第6组获取data-protected值,不包括任何引号

样本匹配

这些是使用建议的正则表达式从源文本中获取的。

[0][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="2582252852"
        data-screen-name="w33haa" data-name="Aliwi Omar" data-protected="false">
[0][1] = "
[0][2] = w33haa
[0][3] = "
[0][4] = Aliwi Omar
[0][5] = "
[0][6] = false

[1][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="1680222842"
        data-screen-name="Jamjomon" data-name="Jamchu :3" data-protected="false">
[1][1] = "
[1][2] = Jamjomon
[1][3] = "
[1][4] = Jamchu :3
[1][5] = "
[1][6] = false

[2][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="1523823648"
        data-screen-name="dimakoza4enko" data-name="Дима Козаченко" data-protected="false">
[2][1] = "
[2][2] = dimakoza4enko
[2][3] = "
[2][4] = Дима Козаченко
[2][5] = "
[2][6] = false

[3][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="1522238240"
        data-screen-name="alupulipulipala" data-name="Wahid Arefin" data-protected="false">
[3][1] = "
[3][2] = alupulipulipala
[3][3] = "
[3][4] = Wahid Arefin
[3][5] = "
[3][6] = false

[4][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="4804204573"
        data-screen-name="thanhbach195" data-name="Mai Thanh Bách" data-protected="false">
[4][1] = "
[4][2] = thanhbach195
[4][3] = "
[4][4] = Mai Thanh Bách
[4][5] = "
[4][6] = false

[5][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="726465523223908353"
        data-screen-name="zeref980" data-name="Yan Naung Htet" data-protected="false">
[5][1] = "
[5][2] = zeref980
[5][3] = "
[5][4] = Yan Naung Htet
[5][5] = "
[5][6] = false

[6][0] = <div class="user-actions btn-group not-following not-muting protected" data-user-id="726459723365502976"
        data-screen-name="Just__Kidding__" data-name="Chaw Chin Fong" data-protected="true">
[6][1] = "
[6][2] = Just__Kidding__
[6][3] = "
[6][4] = Chaw Chin Fong
[6][5] = "
[6][6] = true

[7][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="713605605638938624"
        data-screen-name="Fruitcentre" data-name="Fruit &amp; Veg Centre" data-protected="false">
[7][1] = "
[7][2] = Fruitcentre
[7][3] = "
[7][4] = Fruit &amp; Veg Centre
[7][5] = "
[7][6] = false

[8][0] = <div class="user-actions btn-group not-following not-muting protected" data-user-id="555968644"
        data-screen-name="aeronhalecastle" data-name="Eywon ツ" data-protected="true">
[8][1] = "
[8][2] = aeronhalecastle
[8][3] = "
[8][4] = Eywon ツ
[8][5] = "
[8][6] = true

[9][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="2845398050"
        data-screen-name="Deheyb" data-name="4k Scrub✌️" data-protected="false">
[9][1] = "
[9][2] = Deheyb
[9][3] = "
[9][4] = 4k Scrub✌️
[9][5] = "
[9][6] = false

[10][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="721815663216566272"
        data-screen-name="Ribbon2712" data-name="Даниил Демидов" data-protected="false">
[10][1] = "
[10][2] = Ribbon2712
[10][3] = "
[10][4] = Даниил Демидов
[10][5] = "
[10][6] = false

[11][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="3248438456"
        data-screen-name="zayarmgmg95" data-name="Zayar Mg" data-protected="false">
[11][1] = "
[11][2] = zayarmgmg95
[11][3] = "
[11][4] = Zayar Mg
[11][5] = "
[11][6] = false

[12][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="726440286063198208"
        data-screen-name="Ninderpy" data-name="Derpy" data-protected="false">
[12][1] = "
[12][2] = Ninderpy
[12][3] = "
[12][4] = Derpy
[12][5] = "
[12][6] = false

[13][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="423763655"
        data-screen-name="ImJoehuff" data-name="JoeyT" data-protected="false">
[13][1] = "
[13][2] = ImJoehuff
[13][3] = "
[13][4] = JoeyT
[13][5] = "
[13][6] = false

[14][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="726441786839703556"
        data-screen-name="zxmir_" data-name="Zxmir_" data-protected="false">
[14][1] = "
[14][2] = zxmir_
[14][3] = "
[14][4] = Zxmir_
[14][5] = "
[14][6] = false

[15][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="726440845713367041"
        data-screen-name="hienlequang" data-name="Hiền Lê Quang" data-protected="false">
[15][1] = "
[15][2] = hienlequang
[15][3] = "
[15][4] = Hiền Lê Quang
[15][5] = "
[15][6] = false

[16][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="3032113115"
        data-screen-name="Najer14" data-name="Jan" data-protected="false">
[16][1] = "
[16][2] = Najer14
[16][3] = "
[16][4] = Jan
[16][5] = "
[16][6] = false

[17][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="4762819022"
        data-screen-name="7forOne" data-name="Abiel" data-protected="false">
[17][1] = "
[17][2] = 7forOne
[17][3] = "
[17][4] = Abiel
[17][5] = "
[17][6] = false

[18][0] = <div class="user-actions btn-group not-following not-muting " data-user-id="717061680799330306"
        data-screen-name="Th3uN1qu31" data-name="Th3_uN1Qu3" data-protected="false">
[18][1] = "
[18][2] = Th3uN1qu31
[18][3] = "
[18][4] = Th3_uN1Qu3
[18][5] = "
[18][6] = false

解释

Regular expression visualization

NODE                     EXPLANATION
----------------------------------------------------------------------
  <div                     '<div'
----------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
----------------------------------------------------------------------
  (?=                      look ahead to see if there is:
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the least amount
                             possible)):
----------------------------------------------------------------------
      [^>=]                    any character except: '>', '='
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      ='                       '=\''
----------------------------------------------------------------------
      [^']*                    any character except: ''' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
      '                        '\''
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      ="                       '="'
----------------------------------------------------------------------
      [^"]*                    any character except: '"' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
      "                        '"'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      =                        '='
----------------------------------------------------------------------
      [^'"]                    any character except: ''', '"'
----------------------------------------------------------------------
      [^\s>]*                  any character except: whitespace (\n,
                               \r, \t, \f, and " "), '>' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
    )*?                      end of grouping
----------------------------------------------------------------------
    class=                   'class='
----------------------------------------------------------------------
    ['"]?                    any character of: ''', '"' (optional
                             (matching the most amount possible))
----------------------------------------------------------------------
    user-actions             'user-actions'
----------------------------------------------------------------------
  )                        end of look-ahead
----------------------------------------------------------------------
  (?=                      look ahead to see if there is:
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the least amount
                             possible)):
----------------------------------------------------------------------
      [^>=]                    any character except: '>', '='
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      ='                       '=\''
----------------------------------------------------------------------
      [^']*                    any character except: ''' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
      '                        '\''
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      ="                       '="'
----------------------------------------------------------------------
      [^"]*                    any character except: '"' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
      "                        '"'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      =                        '='
----------------------------------------------------------------------
      [^'"]                    any character except: ''', '"'
----------------------------------------------------------------------
      [^\s>]*                  any character except: whitespace (\n,
                               \r, \t, \f, and " "), '>' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
    )*?                      end of grouping
----------------------------------------------------------------------
    data-screen-name=        'data-screen-name='
----------------------------------------------------------------------
    (                        group and capture to \1:
----------------------------------------------------------------------
      ['"]?                    any character of: ''', '"' (optional
                               (matching the most amount possible))
----------------------------------------------------------------------
    )                        end of \1
----------------------------------------------------------------------
    (                        group and capture to \2:
----------------------------------------------------------------------
      .*?                      any character (0 or more times
                               (matching the least amount possible))
----------------------------------------------------------------------
    )                        end of \2
----------------------------------------------------------------------
    \1                       what was matched by capture \1
----------------------------------------------------------------------
    (?:                      group, but do not capture:
----------------------------------------------------------------------
      \s                       whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      >                        '>'
----------------------------------------------------------------------
    )                        end of grouping
----------------------------------------------------------------------
  )                        end of look-ahead
----------------------------------------------------------------------
  (?=                      look ahead to see if there is:
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the least amount
                             possible)):
----------------------------------------------------------------------
      [^>=]                    any character except: '>', '='
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      ='                       '=\''
----------------------------------------------------------------------
      [^']*                    any character except: ''' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
      '                        '\''
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      ="                       '="'
----------------------------------------------------------------------
      [^"]*                    any character except: '"' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
      "                        '"'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      =                        '='
----------------------------------------------------------------------
      [^'"]                    any character except: ''', '"'
----------------------------------------------------------------------
      [^\s>]*                  any character except: whitespace (\n,
                               \r, \t, \f, and " "), '>' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
    )*?                      end of grouping
----------------------------------------------------------------------
    data-name=               'data-name='
----------------------------------------------------------------------
    (                        group and capture to \3:
----------------------------------------------------------------------
      ['"]?                    any character of: ''', '"' (optional
                               (matching the most amount possible))
----------------------------------------------------------------------
    )                        end of \3
----------------------------------------------------------------------
    (                        group and capture to \4:
----------------------------------------------------------------------
      .*?                      any character (0 or more times
                               (matching the least amount possible))
----------------------------------------------------------------------
    )                        end of \4
----------------------------------------------------------------------
    \3                       what was matched by capture \3
----------------------------------------------------------------------
    (?:                      group, but do not capture:
----------------------------------------------------------------------
      \s                       whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      >                        '>'
----------------------------------------------------------------------
    )                        end of grouping
----------------------------------------------------------------------
  )                        end of look-ahead
----------------------------------------------------------------------
  (?=                      look ahead to see if there is:
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the least amount
                             possible)):
----------------------------------------------------------------------
      [^>=]                    any character except: '>', '='
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      ='                       '=\''
----------------------------------------------------------------------
      [^']*                    any character except: ''' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
      '                        '\''
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      ="                       '="'
----------------------------------------------------------------------
      [^"]*                    any character except: '"' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
      "                        '"'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      =                        '='
----------------------------------------------------------------------
      [^'"]                    any character except: ''', '"'
----------------------------------------------------------------------
      [^\s>]*                  any character except: whitespace (\n,
                               \r, \t, \f, and " "), '>' (0 or more
                               times (matching the most amount
                               possible))
----------------------------------------------------------------------
    )*?                      end of grouping
----------------------------------------------------------------------
    data-protected=          'data-protected='
----------------------------------------------------------------------
    (                        group and capture to \5:
----------------------------------------------------------------------
      ['"]?                    any character of: ''', '"' (optional
                               (matching the most amount possible))
----------------------------------------------------------------------
    )                        end of \5
----------------------------------------------------------------------
    (                        group and capture to \6:
----------------------------------------------------------------------
      .*?                      any character (0 or more times
                               (matching the least amount possible))
----------------------------------------------------------------------
    )                        end of \6
----------------------------------------------------------------------
    \5                       what was matched by capture \5
----------------------------------------------------------------------
    (?:                      group, but do not capture:
----------------------------------------------------------------------
      \s                       whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      >                        '>'
----------------------------------------------------------------------
    )                        end of grouping
----------------------------------------------------------------------
  )                        end of look-ahead
----------------------------------------------------------------------
  (?:                      group, but do not capture (0 or more times
                           (matching the most amount possible)):
----------------------------------------------------------------------
    [^>=]                    any character except: '>', '='
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    ='                       '=\''
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the most amount
                             possible)):
----------------------------------------------------------------------
      [^']                     any character except: '''
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      \\                       '\'
----------------------------------------------------------------------
      '                        '\''
----------------------------------------------------------------------
    )*                       end of grouping
----------------------------------------------------------------------
    '                        '\''
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    ="                       '="'
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the most amount
                             possible)):
----------------------------------------------------------------------
      [^"]                     any character except: '"'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      \\                       '\'
----------------------------------------------------------------------
      "                        '"'
----------------------------------------------------------------------
    )*                       end of grouping
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    =                        '='
----------------------------------------------------------------------
    [^'"]                    any character except: ''', '"'
----------------------------------------------------------------------
    [^\s>]*                  any character except: whitespace (\n,
                             \r, \t, \f, and " "), '>' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )*                       end of grouping
----------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  >                        '>'