.Net核心正则表达式,命名组,嵌套组,反向引用和惰性限定符

时间:2018-11-18 20:42:34

标签: c# .net regex .net-core

我正在尝试从看起来像使用.Net Core 2.1标记的字符串中解析键/值对。

考虑下面的示例Program.cs文件...

我的问题是:

1。

如何编写模式kvp来表现为“键和值(如果存在)”而不是当前的键或值?

例如,在测试用例2的输出中,代替:

=============================
input = <tag KEY1="vAl1">

--------------------
kvp[0] = KEY1
          key   =       KEY1
        value   =
--------------------
kvp[1] = vAl1
          key   =
        value   =       vAl1
=============================

我想看:

=============================
input = <tag KEY1="vAl1">

--------------------
kvp[0] = KEY1="vAl1"
          key   =       KEY1
        value   =       vAl1
=============================

不破坏测试用例9:

=============================
input = <tag noValue1 noValue2>

--------------------
kvp[0] = noValue1
          key   =       noValue1
        value   =
--------------------
kvp[1] = noValue2
          key   =       noValue2
        value   =
=============================

2。

如何编写模式value以在与名为“ quotes ”的组匹配的下一个字符处停止匹配?换句话说,下一个平衡报价。我显然误解了反向引用的工作原理,我的理解是\k<quotes>将由运行时匹配的值(设计时定义的模式不是不是)替换为(?<quotes>[""'`])

例如,在测试用例5的输出中,代替:

--------------------
kvp[4] =  key3='hello,
          key   =
        value   =        key3='hello,
--------------------
kvp[5] = experts
          key   =
        value   =       experts
=============================

我想看看(尽管解决了问题1):

--------------------
kvp[4] =  key3
          key   =        key3
        value   =
--------------------
kvp[5] = hello, "experts"
          key   =
        value   =       hello, "experts"
=============================

3。

如何编写模式value以在/>之前停止匹配?在测试用例7中,key2的值应为thing-1。我不记得自己尝试过的所有内容,但是我没有找到一种在不破坏测试用例6的情况下有效的模式,其中/ 是值的一部分。 / p>


Program.cs

using System;
using System.Reflection;
using System.Text.RegularExpressions;

namespace ConsoleApp1
{
    class Program
    {
        static void Main(string[] args)
        {
            RegExTest();

            Console.ReadLine();
        }

        static void RegExTest()
        {
            // Test Cases
            var case1 = @"<tag>";
            var case2 = @"<tag KEY1=""vAl1"">";
            var case3 = @"<tag kEy2='val2'>";
            var case4 = @"<tag key3=`VAL3`>";
            var case5 = @"<tag           key1='val1' 

                        key2=""http://www.w3.org"" key3='hello, ""experts""'>";
            var case6 = @"<tag :key1 =some/thing>";
            var case7 = @"<tag key2=thing-1/>";
            var case8 = @"<tag key3      =        thing-2>";
            var case9 = @"<tag noValue1 noValue2>";
            var case10 = @"<tag/>";
            var case11 = @"<tag />";

            // A key may begin with a letter, underscore or colon, follow by 
            // zero or more of those, or numbers, periods, or dashs.
            string key = @"(?<key>(?<=\s+)[a-z_:][a-z0-9_:\.-]*?(?=[\s=>]+))";

            // A value may contain any character, and must be wrapped in balanced quotes (double, single, 
            // or back) if the value contains any quote, whitespace, equal, or greater- or less- than 
            // character.
            string value = @"(?<value>((?<=(?<quotes>[""'`])).*?(?=\k<quotes>)|(?<=[=][\s]*)[^""'`\s=<>]+))";

            // A key-value pair must contain a key, 
            // a value is optional
            string kvp = $"(?<kvp>{key}|{value})"; // Without the | (pipe), it doesn't match any test case...

            // ...value needs to be optional (case9), tried:
            //kvp = $"(?<kvp>{key}{value}?)";
            //kvp = $"(?<kvp>{key}({value}?))";
            //kvp = $"(?<kvp>{key}({value})?)";
            // ...each only matches key, but also matches value in case8 as key

            Regex getKvps = new Regex(kvp, RegexOptions.IgnoreCase);

            FormatMatches(getKvps.Matches(case1)); // OK

            FormatMatches(getKvps.Matches(case2)); // OK

            FormatMatches(getKvps.Matches(case3)); // OK

            FormatMatches(getKvps.Matches(case4)); // OK

            FormatMatches(getKvps.Matches(case5)); // Backreference and/or lazy qualifier doesn't work.

            FormatMatches(getKvps.Matches(case6)); // OK

            FormatMatches(getKvps.Matches(case7)); // The / is not part of the value.

            FormatMatches(getKvps.Matches(case8)); // OK

            FormatMatches(getKvps.Matches(case9)); // OK

            FormatMatches(getKvps.Matches(case10)); // OK

            FormatMatches(getKvps.Matches(case11)); // OK
        }

        static void FormatMatches(MatchCollection matches)
        {
            Console.WriteLine(new string('=', 78));

            var _input = matches.GetType().GetField("_input",
                BindingFlags.NonPublic |
                BindingFlags.Instance)
                .GetValue(matches);

            Console.WriteLine($"input = {_input}");
            Console.WriteLine();

            if (matches.Count < 1)
            {
                Console.WriteLine("[kvp not matched]");
                return;
            }

            for (int i = 0; i < matches.Count; i++)
            {
                Console.WriteLine(new string('-', 20));

                Console.WriteLine($"kvp[{i}] = {matches[i].Groups["kvp"]}");
                Console.WriteLine($"\t  key\t=\t{matches[i].Groups["key"]}");
                Console.WriteLine($"\tvalue\t=\t{matches[i].Groups["value"]}");
            }
        }
    }
}

1 个答案:

答案 0 :(得分:2)

您可以使用

\s(?<key>[a-z_:][a-z0-9_:.-]*)(?:\s*=\s*(?:(?<q>[`'"])(?<value>.*?)\k<q>|(?<value>(?:(?!/>)[^\s`'"<>])+)))?

请参见regex demo,其中包含高亮组和.NET regex demo(证明)。

C#用法:

var pattern = @"\s(?<key>[a-z_:][a-z0-9_:.-]*)(?:\s*=\s*(?:(?<q>[`'""])(?<value>.*?)\k<q>|(?<value>(?:(?!/>)[^\s`'""<>])+)))?";
var matches = Regex.Matches(case, pattern, RegexOptions.IgnoreCase);
foreach (Match m in matches)
{
    Console.WriteLine(m.Value);                 // The whole match
    Console.WriteLine(m.Groups["key"].Value);   // Group "key" value
    Console.WriteLine(m.Groups["value"].Value); // Group "value" value
}

详细信息

  • \s-空格
  • (?<key>[a-z_:][a-z0-9_:.-]*)-组“键”:一个字母_:,然后是0+个字母,数字,_:,{{ 1}}或.
  • --出现一次或零次(因此该值是可选的):
    • (?:\s*=\s*(?:(?['"])(?<value>.*?)\k<q>|(?<value>(?:(?!/>)[^\s'"<>])+)))?-一个'"])(?<value>.*?)\k<q>|(?<value>(?:(?!/>)[^\s包含0+空格
    • \s*=\s*-一个非捕获组的开始:
      • =-分隔符(?:(?[`'"])`
      • '-分组“值”,使其与除换行符以外的任何0+个字符尽可能少地匹配
      • "-反向引用组“ q”,相同的值必须匹配
    • (?<value>.*?)-或
      • \k<q>-组“值”:除空格,|(?<value>(?:(?!/>)[^\s`'"<>])+)(?<value>`'以外的一个字符, 1次或多次出现,没有开始"字符序列
  • <-非捕获组的结尾。