正则表达式帮助:我的正则表达式模式将匹配无效的字典

时间:2011-04-01 13:40:21

标签: c# .net regex linq dictionary

我希望你们能帮助我。 我正在使用C#.Net 4.0

我想验证文件结构,如

 
const string dataFileScr = @"
Start 0
{
    Next = 1
    Author = rk
    Date = 2011-03-10
/*  Description = simple */
}

PZ 11
{
IA_return()
}

GDC 7
{
    Message = 6
    Message = 7
        Message = 8
        Message = 8
    RepeatCount = 2
    ErrorMessage = 10
    ErrorMessage = 11
    onKey[5] = 6
    onKey[6] = 4
    onKey[9] = 11
}
";

到目前为止,我设法构建了这个正则表达式模式

 
const string patternFileScr = @"
^                           
((?:\[|\s)*                  

     (?<Section>[^\]\r\n]*)     
 (?:\])*                     
 (?:[\r\n]{0,}|\Z))         
(
    (?:\{)                  ### !! improve for .ini file, dont take { 
    (?:[\r\n]{0,}|\Z)           
        (                          # Begin capture groups (Key Value Pairs)
        (?!\}|\[)                    # Stop capture groups if a } is found; new section  

          (?:\s)*                     # Line with space
          (?<Key>[^=]*?)            # Any text before the =, matched few as possible
          (?:[\s]*=[\s]*)                     # Get the = now
          (?<Value>[^\r\n]*)        # Get everything that is not an Line Changes


         (?:[\r\n]{0,})
         )*                        # End Capture groups
    (?:[\r\n]{0,})
    (?:\})?
    (?:[\r\n\s]{0,}|\Z)
)*

                ";

和c#


  Dictionary <string, Dictionary<string, string>> DictDataFileScr
            = (from Match m in Regex.Matches(dataFileScr,
                                            patternFileScr,
                                            RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline)
               select new
               {
                   Section = m.Groups["Section"].Value,

                   kvps = (from cpKey in m.Groups["Key"].Captures.Cast().Select((a, i) => new { a.Value, i })
                           join cpValue in m.Groups["Value"].Captures.Cast().Select((b, i) => new { b.Value, i }) on cpKey.i equals cpValue.i
                           select new KeyValuePair(cpKey.Value, cpValue.Value)).OrderBy(_ => _.Key)
                           .ToDictionary(kvp => kvp.Key, kvp => kvp.Value)

               }).ToDictionary(itm => itm.Section, itm => itm.kvps);

适用于

 
const string dataFileScr = @"
Start 0
{
    Next = 1
    Author = rk
    Date = 2011-03-10
/*  Description = simple */
}

GDC 7
{
    Message = 6
    RepeatCount = 2
    ErrorMessage = 10
    onKey[5] = 6
    onKey[6] = 4
    onKey[9] = 11
}
";

换句话说

 
Section1
{
key1=value1
key2=value2
}

Section2
{
key1=value1
key2=value2
}

,但是

  • 1。不是多个键名,我想按键分组输出
  • 
    DictDataFileScr["GDC 7"]["Message"] = "6|7|8|8"
    DictDataFileScr["GDC 7"]["ErrorMessage"] = "10|11"
    

  • 2。不适用于.ini文件,如
  • 
    ....
    [Section1]
    key1 = value1
    key2 = value2
    
    [Section2]
    key1 = value1
    key2 = value2
    ...
    

  • 3。在
  • 之后看不到下一节
    
    ....
    PZ 11
    {
    IA_return()
    }
    .....
    

    4 个答案:

    答案 0 :(得分:2)

    帮助自己和你的理智,并学习如何使用GPLexGPPG。它们是C#对Lex和Yacc(或Flex和Bison,如果你愿意)最接近的东西,它们是这项工作的合适工具。

    正则表达式是执行强健字符串匹配的绝佳工具,但是当您需要匹配字符串结构时,需要“语法”。这就是解析器的用途。 GPLex采用一堆正则表达式并生成一个超快的词法分析器。 GPPG采用您编写的语法并生成超快速解析器。

    相信我,学习如何使用这些工具......或者像他们这样的任何其他工具。你会很高兴的。

    答案 1 :(得分:2)

    这是C#中正则表达式的完整返工。

    假设:(告诉我其中一个是假的还是全部都是假的)

    1. INI文件部分只能在其正文中包含键/值对行
    2. 在非INI文件部分中,函数调用不能包含任何参数
    3. 正则表达式标志:
      RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace | RegexOptions.Compiled | RegexOptions.Singleline


      输入测试:

      
      const string dataFileScr = @"
      Start 0
      {
          Next = 1
          Author = rk
          Date = 2011-03-10
      /*  Description = simple */
      }
      
      PZ 11
      {
      IA_return()
      }
      
      GDC 7
      {
          Message = 6
          Message = 7
              Message = 8
              Message = 8
          RepeatCount = 2
          ErrorMessage = 10
          ErrorMessage = 11
          onKey[5] = 6
          onKey[6] = 4
          onKey[9] = 11
      }
      
      [Section1]
      key1 = value1
      key2 = value2
      
      [Section2]
      key1 = value1
      key2 = value2
      ";
      

      重写正则表达式:

      
      const string patternFileScr = @"
      (?<Section>                                                              (?# Start of a non ini file section)
        (?<SectionName>[\w ]+)\s*                                              (?# Capture section name)
           {                                                                   (?# Match but don't capture beginning of section)
              (?<SectionBody>                                                  (?# Capture section body. Section body can be empty)
               (?<SectionLine>\s*                                              (?# Capture zero or more line(s) in the section body)
               (?:                                                             (?# A line can be either a key/value pair, a comment or a function call)
                  (?<KeyValuePair>(?<Key>[\w\[\]]+)\s*=\s*(?<Value>[\w-]*))    (?# Capture key/value pair. Key and value are sub-captured separately)
                  |
                  (?<Comment>/\*.+?\*/)                                        (?# Capture comment)
                  |
                  (?<FunctionCall>[\w]+\(\))                                   (?# Capture function call. A function can't have parameters though)
               )\s*                                                            (?# Match but don't capture white characters)
               )*                                                              (?# Zero or more line(s), previously mentionned in comments)
              )
           }                                                                   (?# Match but don't capture beginning of section)
      )
      |
      (?<Section>                                                              (?# Start of an ini file section)
        \[(?<SectionName>[\w ]+)\]                                             (?# Capture section name)
        (?<SectionBody>                                                        (?# Capture section body. Section body can be empty)
           (?<SectionLine>                                                     (?# Capture zero or more line(s) in the section body. Only key/value pair allowed.)
              \s*(?<KeyValuePair>(?<Key>[\w\[\]]+)\s*=\s*(?<Value>[\w-]+))\s*  (?# Capture key/value pair. Key and value are sub-captured separately)
           )*                                                                  (?# Zero or more line(s), previously mentionned in comments)
        )
      )
      ";
      

      <强>讨论 构建正则表达式以匹配非INI文件部分 (1) 或INI文件部分 (2)

      (1)非INI文件部分 这些部分由部分名称后跟由{和}括起的正文组成。 节名称con包含字母,数字或空格。 截面体由零个或多个线组成。一行可以是键/值对(键=值),注释(/ *这是注释* /)或没有参数的函数调用(my_function())。

      (2)INI文件部分 这些部分由[和]括起来的部分名称组成,后跟零个或多个键/值对。每一对都在一条线上。

    答案 2 :(得分:0)

    #2。不适用于.ini文件

    无法正常工作,因为正则表达式中指出{在[部分]之后需要{。 如果你有这样的东西,你的正则表达式将匹配:

    [Section]
    {
    key = value
    }
    

    答案 3 :(得分:0)

    以下是Perl中的示例。 Perl没有命名捕获数组。可能是因为回溯 也许你可以从正则表达式中选择一些东西。这假设没有{}括号的嵌套。

    修改永远不要单独留下足够的内容,修改后的版本如下。

    use strict;
    use warnings;
    
    my $str = '
    Start 0
    {
        Next = 1
        Author = rk
        Date = 2011-03-10
     /*  Description = simple
     */
    }
    
    asdfasdf
    
    PZ 11
    {
    IA_return()
    }
    
    [ section 5 ]
      this = that
    [ section 6 ]
      this_ = _that{hello() hhh = bbb}
    
    TOC{}
    
    GDC 7
    {
        Message = 6
        Message = 7
            Message = 8
            Message = 8
        RepeatCount = 2
        ErrorMessage = 10
        ErrorMessage = 11
        onKey[5] = 6
        onKey[6] = 4
        onKey[9] = 11
    }
    ';
    
    
    use re 'eval';
    
    my $rx = qr/
    
    \s*
    ( \[ [^\S\n]* )?                     # Grp 1  optional ini section delimeter '['
    (?<Section> \w+ (?:[^\S\n]+ \w+)* )  # Grp 2  'Section'
    (?(1) [^\S\n]* \] |)                 # Condition, if we matched '[' then look for ']'
    \s* 
    
    (?<Body>                   # Grp 3 'Body' (for display only)
       (?(1)| \{ )                   # Condition, if we're not a ini section then look for '{'
    
       (?{ print "Section: '$+{Section}'\n" })  # SECTION debug print, remove in production
    
       (?:                           # _grp_
           \s*                           # whitespace
           (?:                              # _grp_
                \/\* .*? \*\/                    # some comments
              |                               # OR ..
                                                 # Grp 4 'Key'  (tested with print, Perl doesen't have named capture arrays)
                (?<Key> \w[\w\[\]]* (?:[^\S\n]+ [\w\[\]]+)* )
                [^\S\n]* = [^\S\n]*              # =
                (?<Value> [^\n]* )               # Grp 5 'Value' (tested with print)
    
                (?{ print "  k\/v: '$+{Key}' = '$+{Value}'\n" })  # KEY,VALUE debug print, remove in production
              |                               # OR ..
                (?(1)| [^{}\n]* )                # any chars except newline and [{}] on the condition we're not a ini section
            )                               # _grpend_
            \s*                          # whitespace
        )*                           # _grpend_  do 0 or more times 
       (?(1)| \} )                   # Condition, if we're not a ini section then look for '}'
    )
    /x;
    
    
    while ($str =~ /$rx/xsg)
    {
        print "\n";
        print "Body:\n'$+{Body}'\n";
        print "=========================================\n";
    }
    
    __END__
    

    输出

    Section: 'Start 0'
      k/v: 'Next' = '1'
      k/v: 'Author' = 'rk'
      k/v: 'Date' = '2011-03-10'
    
    Body:
    '{
        Next = 1
        Author = rk
        Date = 2011-03-10
     /*  Description = simple
     */
    }'
    =========================================
    Section: 'PZ 11'
    
    Body:
    '{
    IA_return()
    }'
    =========================================
    Section: 'section 5'
      k/v: 'this' = 'that'
    
    Body:
    'this = that
    '
    =========================================
    Section: 'section 6'
      k/v: 'this_' = '_that{hello() hhh = bbb}'
    
    Body:
    'this_ = _that{hello() hhh = bbb}
    
    '
    =========================================
    Section: 'TOC'
    
    Body:
    '{}'
    =========================================
    Section: 'GDC 7'
      k/v: 'Message' = '6'
      k/v: 'Message' = '7'
      k/v: 'Message' = '8'
      k/v: 'Message' = '8'
      k/v: 'RepeatCount' = '2'
      k/v: 'ErrorMessage' = '10'
      k/v: 'ErrorMessage' = '11'
      k/v: 'onKey[5]' = '6'
      k/v: 'onKey[6]' = '4'
      k/v: 'onKey[9]' = '11'
    
    Body:
    '{
        Message = 6
        Message = 7
            Message = 8
            Message = 8
        RepeatCount = 2
        ErrorMessage = 10
        ErrorMessage = 11
        onKey[5] = 6
        onKey[6] = 4
        onKey[9] = 11
    }'
    =========================================