Question

我已经定义了一个解析单个HTML标记的正则表达式。我不是解析整个DOM树，只是单个标签，因此正则表达式似乎是一个很好的匹配。

假设我有一个标签：

<input type="text" disabled value="Something" />

我已将我的正则表达式定义为将标记解析为：

<(?<closing>/)?(?<tname>[a-z][a-z0-9]*)(?:\s+(?<aname>[a-z0-9-_:]+)(?:=(?<quote>['""])(?<avalue>[^'""<>]*)\k<quote>)?)*(?<selfclosing>\s*\/)?>

所以为了让它更具可读性，让我们分解它

1  <
2      (?<closing>/)?
3      (?<tname>[a-z][a-z0-9]*)
4      (?:\s+
5          (?<aname>[a-z0-9-_:]+)
6          (?:=
7              (?<quote>['""])
8              (?<avalue>[^'""<>]*)
9              \k<quote>
10         )?
11     )*
12     (?<selfclosing>\s*\/)?
13 >

行：

1＆amp; 13 - 标签开始和结束
2＆amp; 12 - 无论是关闭还是自动关闭标签
3 - 捕获标记名称
4-11 - 捕获在标记上定义的所有属性（因此最后为*）
5 - 属性名称
6-10 - 非捕获组属性值（如果存在）（因此最后为?）
6 - 匹配=符号
7 - 定义使用qhich引用（单引号或双引号）
8 - 捕获属性值
9 - 匹配用于打开属性值的相同结束语

问题

当我尝试解析前面提到的具有三个属性的input标记时，我可以通过以下方式轻松访问所有属性名称：

match.Groups["aname"].Captures

但我也希望匹配他们的价值观。所以这就是一个问题，因为属性2没有价值。

如何将这些match.Groups["aname"].Captures与match.Groups["avalue"].Captures 相匹配？我的正则表达式

Answer 1

考虑一下：

<
(?<closing>/)?
(?<tname>[a-z][a-z0-9]*)
(?:
  \s+
  (?<aname>[a-z0-9-_:]+)
  (?:
    =?
    (?<quote>['"]?)
    (?<avalue>[^'"<>]*)
    \k<quote>
  )
)*
(?<selfclosing>\s*\/)?
>

它会匹配一些无效的标记：

<input type="text" disabled"" value="Something" />
<input type="text" disabled= value="Something" />

但你可以通过添加前瞻来解决这个问题：

<
(?<closing>/)?
(?<tname>[a-z][a-z0-9]*)
(?:
  \s+
  (?<aname>[a-z0-9-_:]+)
  (?:
    (?:
      =
      (?=\S)|
      (?=\s)
    )
    (?<quote>['"]?)
    (?<avalue>[^'"<>]*)
    \k<quote>
  )
)*
(?<selfclosing>\s*\/)?
>

aname和avalue会对齐。

Answer 2

最终解决方案

为了供将来参考，我添加了我最后实现的代码。正则表达式比问题中的更复杂，但它正确解析HTML字符串中的标签（打开和关闭）。它使用一些积极的外观，因为捕获组不能用于条件，因为它们是量化块的一部分。我的意思是我可以写

<?<hasvalue>=)?

将正确检测属性是否具有分配给它的值。但问题是这个命名的捕获组是attribute命名捕获组的一部分，它具有量词*（零个或多个属性）。当我们开始解析一个有价值的状态属性时，会出现hasvalue的问题。正则表达式处理器的工作方式仍将读取先前属性的hasvalue成功状态。这就是为什么我宁愿在条件限制中使用lookbehinds，因为它们不会捕获并且处理器不会以任何形式保持其状态。

而不是lookbehinds可以使用balancing groups，但这将使正则表达式.net而不是通用（ish）。这就是为什么我选择了lookbehinds所以它可以用于其他具有不同正则表达式处理器的environemnts。那些不支持命名捕获组的处理器仍然可以使用这个正则表达式，只需要删除组名并引用组索引。

重要提示：请注意，属性名称捕获计数并不匹配属性值捕获计数，这是我在正则表达式解析时的初始问题。因此，不是匹配那些我匹配属性名称与等号。这样我总是知道某个属性是仅状态属性（即selected还是disabled）还是有价值属性（即value="42"）。您将在下面的代码中看到如何使用此信息匹配属性名称和值。

正则表达式

<                           # TAG START
(?<isclosing>\/\s*)?            # is this a closing tag
(?<tag>[\w:-]+)                 # capture tag name
(?<attribute>
    \s+                         # every attribute starts with at least one space
    (?<name>[\w:-]+)            # capture attribute name
    \s*                         # there may be arbitrary spaces between attribute name and '='
    (?<hasvalue>=?)             # is this a state or valued attribute
    (?(?<==)                    # if it's a valued attribute process its value
        \s*                     # there may be arbitrary spaces between '=' and attribute value
        (?<quote>['"]?)         # if value is quoted, capture its quote type
        (?<value>               # capture raw value without quotes if present
            (?(?<=')
                [^']*|          # parse single quoted attribute value
                (?(?<=")
                    [^"]*|      # parse double quoted attribute value
                    [\w]+       # parse unquoted attribute value
                )
            )
        )
        \k<quote>               # quotes must match each other
    )
)*
\s*                             # there may be some spaces before tag end
(?<selfclosing>\/)?             # is this a self-closing tag i.e. '<br/>'
>                           # TAG END

解析代码

这是从HTML文本

解析标签的实际代码的摘录

IList<KeyValuePair<string, string>> tagAttributes;
StringBuilder parsedText = new StringBuilder();

char character;
for (int index = 0, length = htmlContent.Length; index < length; index++)
{
    character = htmlContent[index];

    if (character == '<')
    {
        Match tagMatch = tagRegex.Match(htmlContent, index);

        // prevent parsing of invalid HTML with incomplete tags like <tabl<p>This is a paragraph</p>
        if (tagMatch.Index != index)
        {
            parsedText.Append("&lt;");
            continue;
        }

        if (tagMatch.Groups["attribute"].Success)
        {
            tagAttributes = new List<KeyValuePair<string, string>>();

            // attribute count and their type counts (valued or state only) match
            var attributes = tagMatch
                .Groups["name"]
                .Captures
                .Cast<Capture>()
                .Select(c => c.Value)
                .Zip(
                    tagMatch
                        .Groups["hasvalue"]
                        .Captures
                        .Cast<Capture>()
                        .Select(c => c.Value == "="),
                    (name, isvalued) => new KeyValuePair<string, bool>(name, isvalued)
                );

            // attribute values count may be less than there are attributes
            IEnumerator<string> values = tagMatch
                .Groups["value"]
                .Captures
                .Cast<Capture>()
                .Select(c => c.Value)
                .GetEnumerator();

            foreach (var attribute in attributes)
            {
                tagAttributes.Add(
                    new KeyValuePair<string, string>(
                        attribute.Key,
                        attribute.Value && values.MoveNext() ?
                            values.Current :
                            null)
                );
            }
        }

        /* Do whatever you need with these
         *
         * TAG NAME <= tagMatch.Groups["tag"].Value;
         * IS CLOSING <= tagMatch.Groups["isclosing"].Success;
         * IS SELF CLOSING <= tagMatch.Groups["selfclosing"].Success;
         * ATTRIBUTES LIST <= tagAttributes
         */

        // advance content character index past currently regex matched tag definition
        index += tagMatch.Length - 1;
    }
    else
    {
        parsedText.Append(character);
    }
}

这将正确解析具有可能是仅状态或有价值的属性的标记。有价值的属性可能有不带引号，单引号或双引号的值。引用的值可能是空字符串。

如何将正则表达式组捕获集合与另一个组捕获集合进行匹配？

问题

2 个答案:

最终解决方案

正则表达式

解析代码