TL; DR

Question

TL; DR

我正在使用C＃regex解析报告，启用MultiLine，使用具有命名组的单个（复杂）正则表达式模式处理整个文件。（和CaptureCollection。）

我的报告部分出现故障或以我无法预测的方式丢失。

无论它们出现在哪个顺序，我如何匹配它们？

前言

我正在使用System.Text.RegularExpressions在C＃（。Net 3.5）中使用正则表达式解析报表。报告的一部分如下所示：

     Section Z              0 __ base 10
                            2 __ 19/04 20:06:39
                            2 __ 19/04 20:15:49
                          1.8 __ 19/04 20:09:35
                          1.6 __ 19/04 20:07:01
                          1.6 __ 19/04 20:08:29
     Section 7            0.8 __ base 10
                            8 __ 18/04 21:03:01
                          7.3 __ 18/04 21:02:17
                          3.7 __ 19/04 08:41:09
                          3.4 __ 19/04 00:13:08
                          3.3 __ 18/04 21:02:50
     Section C              0 __ base 10
                         19.7 __ 19/04 10:25:06
                         11.1 __ 19/04 10:15:01
                          8.8 __ 19/04 10:14:50
                          7.2 __ 19/04 19:51:37
                          6.1 __ 19/04 14:19:47

我的正则表达式使用选项(?mx)（MultiLine，IgnorePatternWhitespace）将文本文件整体匹配。因为统计部分包含每个部分的子统计信息，所以我使用手动制作每个部分（可选?）非捕获组（(?:match_this_text)）并按照我认为的顺序将它们放入模式中正在发生，如下：

(?mx) #Turn on options multiline, ignore whitespace.
(?: # base 10 statistic sections
    (?:
        [\s-[\n\r]]*(?i:Section\sZ)\s+(?<base10_SectionZ>\d+\.\d|\d+)\s__\sbase\s10
        (?:\r?\n)+
        (?:\s+(?<base10_SectionZ_instance>\d+\.\d|\d+)\s__\s(?<base10_SectionZ_instance_time>\d\d/\d\d\s\d\d:\d\d:\d\d)(?:\r?\n)+)+
    )?
    (?:
        [\s-[\n\r]]*(?i:Section\s7)\s+(?<base10_Section7>\d+\.\d|\d+)\s__\sbase\s10
        (?:\r?\n)+
        (?:\s+(?<base10_Section7_instance>\d+\.\d|\d+)\s__\s(?<base10_Section7_instance_time>\d\d/\d\d\s\d\d:\d\d:\d\d)(?:\r?\n)+)+
    )?
    (?:
        [\s-[\n\r]]*(?i:Section\sC)\s+(?<base10_SectionC>\d+\.\d|\d+)\s__\sbase\s10
        (?:\r?\n)+
        (?:\s+(?<base10_SectionC_instance>\d+\.\d|\d+)\s__\s(?<base10_SectionC_instance_time>\d\d/\d\d\s\d\d:\d\d:\d\d)(?:\r?\n)+)+
    )?
)

每个部分的非捕获组的第一行与“部分标题”匹配，第二行匹配标题和统计实例之间的换行，第三行匹配各个统计实例（重复， n < / em>实例数。）

问题

生成此报告的程序（取决于正在运行的版本）以不同的顺序输出每个部分（例如，部分Z，部分7，部分C），并且在某些情况下缺少某些部分。当我针对第二个测试文件运行它时，它失败了，因为这些部分出了故障。

因此，C部分可能出现在Z部分之前，但正则表达式模式期望Z出现在C之前。

基本上，我希望相同的正则表达式匹配并提取（使用上面的命名组）相同的数据，而不管各部分的顺序如何，这样它就匹配上面的测试数据和这个测试数据：

Section 7 0.8 __ base 10 8 __ 18/04 21:03:01 7.3 __ 18/04 21:02:17 3.7 __ 19/04 08:41:09 3.4 __ 19/04 00:13:08 3.3 __ 18/04 21:02:50 Section C 0 __ base 10 19.7 __ 19/04 10:25:06 11.1 __ 19/04 10:15:01 8.8 __ 19/04 10:14:50 7.2 __ 19/04 19:51:37 6.1 __ 19/04 14:19:47 Section Z 0 __ base 10 2 __ 19/04 20:06:39 2 __ 19/04 20:15:49 1.8 __ 19/04 20:09:35 1.6 __ 19/04 20:07:01 1.6 __ 19/04 20:08:29

Answer 1

您只想捕获每个部分？

这不会有用吗？ (Section ..*(?:\r.*){0,5})

http://regexr.com?30nfd

Answer 2

我认为在这种情况下，拥有几个不同的正则表达式比一个巨型正则表达式可能更好。我会File.RealAllLines然后使用If String.Contains("Section")遍历每一行。如果它包含section，则创建一个新的section对象，运行section regex以填充新的section对象（section name和section data）。如果它不包含 section ，则为其他节数据运行另一个正则表达式并将其附加到当前节对象。

Answer 3

您可能希望使用\ G选项将每个表达式锚定到上一个结果，这样您仍然可以确保不需要的部分之间没有任何内容。

您可以对部分使用更通用的表达式：

(?mx) #Turn on options multiline, ignore whitespace.
\G
(?: # base 10 statistic sections
    (?:
        [\s-[\n\r]]*(?i:Section\s(Z|7|C))\s+(?<base10_Section>\d+\.\d|\d+)\s__\sbase\s10
        (?:\r?\n)+
        (?:\s+(?<base10_Section_instance>\d+\.\d|\d+)\s__\s(?<base10_Section_instance_time>\d\d/\d\d\s\d\d:\d\d:\d\d)(?:\r?\n)+)+
    )
)

然后验证某个部分是否重复或缺失。 See it in action

Answer 4

你不应该给正则表达式引擎提供任何匹配的选项在找到可选的东西之前，它会四处寻找很多“没事”。

修改

如果你只是想要一个块匹配（任何顺序，但是顺序），这样的事情就可以了。
你现在的方式，修改：

(?:
   (?: Section ...  (?<sec_7> 7)
   )
 | (?: Section ...  (?<sec_C> C)
   )?
 | (?: Section ...  (?<sec_Z> Z)
   )
)
(?: Section ...  (?!\k<sec_7>) (?<sec_7>  7) )?
(?: Section ...  (?!\k<sec_C>) (?<sec_C>  C) )?
(?: Section ...  (?!\k<sec_Z>) (?<sec_Z>  Z) )?

如果可以考虑，那么这样：

(?: Section ...  (?<sec_a>(?:7|C|Z) )
(?: Section ...  (?<sec_b>(?!\k<sec_a>)(?:7|C|Z)  )?
(?: Section ...  (?<sec_c>(?!\k<sec_a>|\k<sec_b>)(?:7|C|Z)  )?
#
# Then after match check <sec_a/b/c> for its value

如果您不关心区块匹配：
您的案例仅围绕OR条件。所以，它可以像这样简单：

# base 10 statistic sections
    (?: ..)
  |
    (?: ..)
  |
    (?: ..)

必须在while循环中检查'base 10'部分匹配中的每个匹配

Match m = Regex.Match(input, regex, RegexOptions.IgnorePatternWhitespace);
while (m.Success)
{
   if (m.Groups["base10_Section7"].Success)  {    }
   else
   if (m.Groups["base10_SectionZ"].Success)  {    }
   else
   if (m.Groups["base10_SectionC"].Success)  {    }
   m = m.NextMatch();
}

即使这样也可以减少。例如7，Z，C可以组合在一个块中这将使其他不同项的OR（|）匹配，例如'base 2'，
或任何其他形式。一种形式将匹配。无论如何必须进行检查。

string input = @"
    Section Z              0 __ base 10
                           2 __ 19/04 20:06:39
                           2 __ 19/04 20:15:49
                         1.8 __ 19/04 20:09:35
                         1.6 __ 19/04 20:07:01
                         1.6 __ 19/04 20:08:29
    Section P           16.1 __ base 2
    Section 7            0.8 __ base 10
                           8 __ 18/04 21:03:01
                         7.3 __ 18/04 21:02:17
                         3.7 __ 19/04 08:41:09
                         3.4 __ 19/04 00:13:08
                         3.3 __ 18/04 21:02:50
    Section C              0 __ base 10
                        19.7 __ 19/04 10:25:06
                        11.1 __ 19/04 10:15:01
                         8.8 __ 19/04 10:14:50
                         7.2 __ 19/04 19:51:37
                         6.1 __ 19/04 14:19:47
    Section r           49.2 __ Base 2
";

string regex = @"
   # base 10 statistic sections
       (?:
         [\s-[\n\r]]*(?i:Section\s(?<base10_Section>Z|7|C)\s+(?<Base10>\d+\.\d|\d+)\s__\sbase)\s10
         (?:\r?\n)+
         (?:\s+(?<Instance>\d+\.\d|\d+)\s__\s(?<Time>\d\d/\d\d\s\d\d:\d\d:\d\d)(?:\r?\n)+)+
       )
     |  # Or, base 2 statistic sections
       (?:
         [\s-[\n\r]]*(?i:Section\s(?<base2_Section>R|P)\s+(?<Base2>\d+\.\d|\d+)\s__\sbase)\s2
         (?:\r?\n)+
       )
   # |  Or, something else

";

Match m = Regex.Match(input, regex, RegexOptions.IgnorePatternWhitespace);
int matchCount = 0;
while (m.Success)
{
    Console.WriteLine("\nMatch " + (++matchCount) + "\n------------------");
    // Check base 10
    if (m.Groups["base10_Section"].Success)
    {
        Console.WriteLine("Section (base10)  '" + m.Groups["base10_Section"] + "'  =  '" + m.Groups["Base10"] + "'\n");

        int count = m.Groups["Instance"].Captures.Count;
        // Instance
        for (int j = 0; j < count; j++)
            System.Console.WriteLine("    Instance (" + j + ") =  '" + m.Groups["Instance"].Captures[j] + "' ");
        // Time
        for (int j = 0; j < count; j++)
            System.Console.WriteLine("    Time(" + j + ") =  '" + m.Groups["Time"].Captures[j] + "' ");
        // Combined ..
        for (int j = 0; j < count; j++)
            System.Console.WriteLine("    Instance,Time  (" + j + ") =  '" +
                                          m.Groups["Instance"].Captures[j] + "' __ '" +
                                          m.Groups["Time"].Captures[j] + "' ");
    }
    else
    // Check base 2
    if (m.Groups["base2_Section"].Success)
        Console.WriteLine("Section (base2)  '" + m.Groups["base2_Section"] + "'  =  '" + m.Groups["Base2"] + "'\n");

    m = m.NextMatch();
}

匹配神秘排序的复杂正则表达式非捕获组的多个实例

TL; DR

前言

问题

4 个答案: