Question

I'm trying to figure out a clean way to split this on a variable separator String. I'm close but it's not grabbing the stuff after the newline. No amount of r^@(pv|rv|dv|ex) seems to do the trick.

re.DOTALL|re.MULTILINE

I'm close but it's not correctly capturing the multi-line components.

Answer 1

也许更好的方法是：

使用re.S以便.匹配换行符
在每行的开头使用re.M作为锚点
使用不那么贪婪的.*?
使用前瞻重叠匹配

看起来像这样：

^@(pv|rv|dv|ex)@\s+(.*?)(?=^@(?:pv|rv|dv|ex)@|\Z)

Demo

Python演示：

txt="""@ex@ 70526 1422983747
@rv@ 1 @db.counters@ @change@ @1010@
@pv@ 0 @db.desc@ 1010 @PM recovery log.9
@
@pv@ 2 @db.change@ 1010 1010 @192_168_1_20~ac~70520@ @bozo@ 1422983747 0 @PM recovery log.9
@ @@
@pv@ 2 @db.changex@ 1010 1010 @192_168_1_20~ac~70520@ @bozo@ 1422983747 0 @PM recovery log.9
@ @@
@rv@ 10 @db.working@ @//192_168_1_20~ac~70520/pgtext/log.9@ @//depot/fyz/pgtext/log.9@ @192_168_1_20~ac~70520@ @bozo@ 0 1 0 160 0 1010 0 0 00000000000000000000000000000000 -1 0 0 160 @@ 0
@ex@ 70526 1422983747"""

import re

SEPS = ['pv', 'rv', 'dv', 'ex']

sep_st='|'.join(SEPS)

pat=re.compile(r'^@({})@\s+(.*?)(?=^@(?:{})@|\Z)'.format(sep_st, sep_st), re.S | re.M)

>>> '\n'.join(repr(e) for e in pat.findall(txt))
('ex', '70526 1422983747\n')
('rv', '1 @db.counters@ @change@ @1010@\n')
('pv', '0 @db.desc@ 1010 @PM recovery log.9\n@\n')
('pv', '2 @db.change@ 1010 1010 @192_168_1_20~ac~70520@ @bozo@ 1422983747 0 @PM recovery log.9\n@ @@\n')
('pv', '2 @db.changex@ 1010 1010 @192_168_1_20~ac~70520@ @bozo@ 1422983747 0 @PM recovery log.9\n@ @@\n')
('rv', '10 @db.working@ @//192_168_1_20~ac~70520/pgtext/log.9@ @//depot/fyz/pgtext/log.9@ @192_168_1_20~ac~70520@ @bozo@ 0 1 0 160 0 1010 0 0 00000000000000000000000000000000 -1 0 0 160 @@ 0\n')
('ex', '70526 1422983747')

Answer 2

最简单的方法是使用断言进行匹配，直到下一个变量分隔符。

 # (?m)^(@(pv|rv|dv|ex)@(?:(?!@(?:pv|rv|dv|ex)@)[\S\s])*)

 (?m)                          # Multi-line mode
 ^                             # BOL 
 (                             # (1 start), Whole record
      @                             # Delimiter start
      ( pv | rv | dv | ex )         # (2), Delimiter contents
      @                             # Delimiter end
      (?:                           # Match anything up until the next delimiter
           (?!                           # ASSERT - not object delimiter
                @
                (?: pv | rv | dv | ex )
                @
           )
           [\S\s]                        # Grab this (any) character
      )*                            # Do zero to many times (until next delimiter)
 )                             # (1 end), Record

如果您需要验证正确的表单，也可以这样做。

 # (?m)^(@(pv|rv|dv|ex)@(?:[^@]*(?:@(?!(?:pv|rv|dv|ex)@)[^@]*@)?)*)

 (?m)                          # Multi-line mode
 ^                             # BOL 
 (                             # (1 start), Whole record
      @                             # Delimiter start
      ( pv | rv | dv | ex )         # (2), Object delimiter contents
      @                             # Delimiter end
      (?:                           # Match anything up until the next delimiter
           [^@]*                         # Zero or more Not @
           (?:                           # Optional non object delimiter
                @                             # Delimiter start
                (?!                           # ASSERT - not object delimiter content
                     (?: pv | rv | dv | ex )
                     @ 
                )
                [^@]*                         # Zero or more Not @
                @                             # Delimiter end
           )?
      )*                            # Do zero to many times (until next delimiter)
 )                             # (1 end), Record

Answer 3

>>> seps_re =re.compile("\@(%s)\@"%"|".join(SEPS))
>>> seps_re.split(s)

尽可能接近......而且它很接近......我认为你可以接受它并让它与这样的东西一起工作

>>> seps_re =re.compile("\@(%s)\@"%"|".join(SEPS))
>>> my_items = filter(None,seps_re.split(s))
>>> my_pairs = zip(*[iter(my_items)]*2)
>>> print my_pairs
[('ex', ' 70526 1422983747\n'), ('rv', ' 1 @db.counters@ @change@ @1010@\n'), ('
pv', ' 0 @db.desc@ 1010 @PM recovery log.9\n@\n'), ('pv', ' 2 @db.change@ 1010 1
010 @192_168_1_20~ac~70520@ @bozo@ 1422983747 0 @PM recovery log.9\n@ @@\n'), ('
pv', ' 2 @db.changex@ 1010 1010 @192_168_1_20~ac~70520@ @bozo@ 1422983747 0 @PM
recovery log.9\n@ @@\n'), ('rv', ' 10 @db.working@ @//192_168_1_20~ac~70520/pgte
xt/log.9@ @//depot/fyz/pgtext/log.9@ @192_168_1_20~ac~70520@ @bozo@ 0 1 0 160 0
1010 0 0 00000000000000000000000000000000 -1 0 0 160 @@ 0\n'), ('ex', ' 70526 14
22983747')]

Greedy split across multiline

3 个答案: