Greedy split across multiline

时间:2015-05-12 23:29:20

标签: python regex

I'm trying to figure out a clean way to split this on a variable separator String. I'm close but it's not grabbing the stuff after the newline. No amount of r^@(pv|rv|dv|ex) seems to do the trick.

re.DOTALL|re.MULTILINE

I'm close but it's not correctly capturing the multi-line components.

3 个答案:

答案 0 :(得分:1)

也许更好的方法是:

  1. 使用re.S以便.匹配换行符
  2. 在每行的开头使用re.M作为锚点
  3. 使用不那么贪婪的.*?
  4. 使用前瞻重叠匹配
  5. 看起来像这样:

    ^@(pv|rv|dv|ex)@\s+(.*?)(?=^@(?:pv|rv|dv|ex)@|\Z)
    

    Demo

    Python演示:

    txt="""@ex@ 70526 1422983747
    @rv@ 1 @db.counters@ @change@ @1010@
    @pv@ 0 @db.desc@ 1010 @PM recovery log.9
    @
    @pv@ 2 @db.change@ 1010 1010 @192_168_1_20~ac~70520@ @bozo@ 1422983747 0 @PM recovery log.9
    @ @@
    @pv@ 2 @db.changex@ 1010 1010 @192_168_1_20~ac~70520@ @bozo@ 1422983747 0 @PM recovery log.9
    @ @@
    @rv@ 10 @db.working@ @//192_168_1_20~ac~70520/pgtext/log.9@ @//depot/fyz/pgtext/log.9@ @192_168_1_20~ac~70520@ @bozo@ 0 1 0 160 0 1010 0 0 00000000000000000000000000000000 -1 0 0 160 @@ 0
    @ex@ 70526 1422983747"""
    
    import re
    
    SEPS = ['pv', 'rv', 'dv', 'ex']
    
    sep_st='|'.join(SEPS)
    
    pat=re.compile(r'^@({})@\s+(.*?)(?=^@(?:{})@|\Z)'.format(sep_st, sep_st), re.S | re.M)
    
    >>> '\n'.join(repr(e) for e in pat.findall(txt))
    ('ex', '70526 1422983747\n')
    ('rv', '1 @db.counters@ @change@ @1010@\n')
    ('pv', '0 @db.desc@ 1010 @PM recovery log.9\n@\n')
    ('pv', '2 @db.change@ 1010 1010 @192_168_1_20~ac~70520@ @bozo@ 1422983747 0 @PM recovery log.9\n@ @@\n')
    ('pv', '2 @db.changex@ 1010 1010 @192_168_1_20~ac~70520@ @bozo@ 1422983747 0 @PM recovery log.9\n@ @@\n')
    ('rv', '10 @db.working@ @//192_168_1_20~ac~70520/pgtext/log.9@ @//depot/fyz/pgtext/log.9@ @192_168_1_20~ac~70520@ @bozo@ 0 1 0 160 0 1010 0 0 00000000000000000000000000000000 -1 0 0 160 @@ 0\n')
    ('ex', '70526 1422983747')
    

答案 1 :(得分:1)

最简单的方法是使用断言进行匹配,直到下一个变量分隔符。

 # (?m)^(@(pv|rv|dv|ex)@(?:(?!@(?:pv|rv|dv|ex)@)[\S\s])*)

 (?m)                          # Multi-line mode
 ^                             # BOL 
 (                             # (1 start), Whole record
      @                             # Delimiter start
      ( pv | rv | dv | ex )         # (2), Delimiter contents
      @                             # Delimiter end
      (?:                           # Match anything up until the next delimiter
           (?!                           # ASSERT - not object delimiter
                @
                (?: pv | rv | dv | ex )
                @
           )
           [\S\s]                        # Grab this (any) character
      )*                            # Do zero to many times (until next delimiter)
 )                             # (1 end), Record

如果您需要验证正确的表单,也可以这样做。

 # (?m)^(@(pv|rv|dv|ex)@(?:[^@]*(?:@(?!(?:pv|rv|dv|ex)@)[^@]*@)?)*)

 (?m)                          # Multi-line mode
 ^                             # BOL 
 (                             # (1 start), Whole record
      @                             # Delimiter start
      ( pv | rv | dv | ex )         # (2), Object delimiter contents
      @                             # Delimiter end
      (?:                           # Match anything up until the next delimiter
           [^@]*                         # Zero or more Not @
           (?:                           # Optional non object delimiter
                @                             # Delimiter start
                (?!                           # ASSERT - not object delimiter content
                     (?: pv | rv | dv | ex )
                     @ 
                )
                [^@]*                         # Zero or more Not @
                @                             # Delimiter end
           )?
      )*                            # Do zero to many times (until next delimiter)
 )                             # (1 end), Record

答案 2 :(得分:0)

>>> seps_re =re.compile("\@(%s)\@"%"|".join(SEPS))
>>> seps_re.split(s)

尽可能接近......而且它很接近......我认为你可以接受它并让它与这样的东西一起工作

>>> seps_re =re.compile("\@(%s)\@"%"|".join(SEPS))
>>> my_items = filter(None,seps_re.split(s))
>>> my_pairs = zip(*[iter(my_items)]*2)
>>> print my_pairs
[('ex', ' 70526 1422983747\n'), ('rv', ' 1 @db.counters@ @change@ @1010@\n'), ('
pv', ' 0 @db.desc@ 1010 @PM recovery log.9\n@\n'), ('pv', ' 2 @db.change@ 1010 1
010 @192_168_1_20~ac~70520@ @bozo@ 1422983747 0 @PM recovery log.9\n@ @@\n'), ('
pv', ' 2 @db.changex@ 1010 1010 @192_168_1_20~ac~70520@ @bozo@ 1422983747 0 @PM
recovery log.9\n@ @@\n'), ('rv', ' 10 @db.working@ @//192_168_1_20~ac~70520/pgte
xt/log.9@ @//depot/fyz/pgtext/log.9@ @192_168_1_20~ac~70520@ @bozo@ 0 1 0 160 0
1010 0 0 00000000000000000000000000000000 -1 0 0 160 @@ 0\n'), ('ex', ' 70526 14
22983747')]