在Python中使用多个正则表达式匹配“计数”

时间:2011-02-15 23:09:12

标签: python regex

假设我有以下多行字符串:

# Section
## Subsection
## Subsection
# Section
## Subsection
### Subsubsection
### Subsubsection
# Section
## Subsection

我希望它成为:

# 1 Section
## 1.1 Subsection
## 1.2 Subsection
# 2 Section
## 2.1 Subsection
### 2.1.1 Subsubsection
### 2.1.2 Subsubsection
# 3 Section
## 3.1 Subsection

在Python中,使用re模块,是否可以在字符串上运行替换:

  • 根据#
  • 的数量匹配每行的开头
  • 跟踪常见编号的#
  • 组的过去匹配
  • 在适当的时候将此计数器插入

...假设这些'计数器'中的任何一个总是非零?

这个问题正在测试我的正则表达式知识的极限。我已经知道我可以迭代这些行并增加/插入一些变量,所以这不是我想要的解决方案。我只是好奇这种功能是否只存在于正则表达式中,因为我知道某种计数已经存在(例如,要进行的替换次数)。

6 个答案:

答案 0 :(得分:3)

«好的,当然,但如果在re.sub的回调函数中进行'变量操作'会怎样呢?那么可以吗?我想我的问题的一个简化形式是:“根据先前的匹配,可以使用正则表达式进行不同的替换吗?”»

听起来我们需要一个生成器函数作为回调;遗憾的是,re.sub()不接受生成器函数作为回调。

所以我们必须使用一些技巧:

import re

pat = re.compile('^(#+)',re.MULTILINE)

ch = '''# Section
## Subsection
## Subsection
# Section
## Subsection
### Subsubsection
### Subsubsection
## Subsection
### Subsubsection
### Subsubsection
#### Sub4section
#### Sub4section
#### Sub4section
#### Sub4section
##### Sub5section
#### Sub4section
##### Sub5section
##### Sub5section
### Subsubsection
### Subsubsection
#### Sub4section
#### Sub4section
## Subsection
### Subsubsection
### Subsubsection
### Subsubsection
#### Sub4section
##### Sub5section
##### Sub5section
### Subsubsection
#### Sub4section
## Subsection
### Subsubsection
### Subsubsection
# Section
## Subsection
## Subsection
# Section
## Subsection
### Subsubsection
#### Sub4section
#### Sub4section
#### Sub4section
##### Sub5section
#### Sub4section
### Subsubsection
## Subsection
### Subsubsection
# Section
## Subsection
'''

def cbk(match, nb = [0] ):
    if len(match.group())==len(nb):
        nb[-1] += 1
    elif  len(match.group())>len(nb):
        nb.append(1)
    else:
        nb[:] = nb[0:len(match.group())]
        nb[-1] += 1
    return match.group()+' '+('.'.join(map(str,nb)))

ch = pat.sub(cbk,ch)
print ch

«执行函数定义时,将评估默认参数值。这意味着当定义函数时,表达式被计算一次,并且每个调用使用相同的“预先计算”值。这对于理解当默认参数是可变对象(例如列表或字典)时非常重要:如果函数修改对象(例如通过附加项目到列表),默认值实际上已被修改。这通常不是预期的。 »

http://docs.python.org/reference/compound_stmts.html#function

但在这里,这是我的明确意图。

结果:

# 1 Section
## 1.1 Subsection
## 1.2 Subsection
# 2 Section
## 2.1 Subsection
### 2.1.1 Subsubsection
### 2.1.2 Subsubsection
## 2.2 Subsection
### 2.2.1 Subsubsection
### 2.2.2 Subsubsection
#### 2.2.2.1 Sub4section
#### 2.2.2.2 Sub4section
#### 2.2.2.3 Sub4section
#### 2.2.2.4 Sub4section
##### 2.2.2.4.1 Sub5section
#### 2.2.2.5 Sub4section
##### 2.2.2.5.1 Sub5section
##### 2.2.2.5.2 Sub5section
### 2.2.3 Subsubsection
### 2.2.4 Subsubsection
#### 2.2.4.1 Sub4section
#### 2.2.4.2 Sub4section
## 2.3 Subsection
### 2.3.1 Subsubsection
### 2.3.2 Subsubsection
### 2.3.3 Subsubsection
#### 2.3.3.1 Sub4section
##### 2.3.3.1.1 Sub5section
##### 2.3.3.1.2 Sub5section
### 2.3.4 Subsubsection
#### 2.3.4.1 Sub4section
## 2.4 Subsection
### 2.4.1 Subsubsection
### 2.4.2 Subsubsection
# 3 Section
## 3.1 Subsection
## 3.2 Subsection
# 4 Section
## 4.1 Subsection
### 4.1.1 Subsubsection
#### 4.1.1.1 Sub4section
#### 4.1.1.2 Sub4section
#### 4.1.1.3 Sub4section
##### 4.1.1.3.1 Sub5section
#### 4.1.1.4 Sub4section
### 4.1.2 Subsubsection
## 4.2 Subsection
### 4.2.1 Subsubsection
# 5 Section
## 5.1 Subsection

编辑1:我将其他nb [:] = nb [0:len(match.group())] 更正为其他:

编辑2:代码可以压缩到

def cbk(match, nb = [0] ):
    if len(match.group())>len(nb):
        nb.append(1)
    else:
        nb[:] = nb[0:len(match.group())]
        nb[-1] += 1
    return match.group()+' '+('.'.join(map(str,nb))) 

答案 1 :(得分:1)

正则表达式用于匹配字符串。它们不是用于在匹配发生时操纵变量。您可能不喜欢迭代每一行并计算自己的解决方案,但这是一个简单的解决方案。

答案 2 :(得分:1)

Pyparsing将这些扫描/匹配/替换任务中的几个打包到自己的解析框架中。以下是您所述问题的注释解决方案:

from pyparsing import LineStart, Word, restOfLine

source = """\
# Section 
## Subsection 
## Subsection 
# Section 
## Subsection #
### Subsubsection 
### Subsubsection 
# Section 
## Subsection 
"""

# define a pyparsing expression to match a header line starting with some 
# number of '#'s (i.e., a "word" composed of '#'s), followed by the rest 
# of the line
sectionHeader = LineStart() + Word("#")("level") + restOfLine

# define a callback to keep track of the nesting and numbering
numberstack = [0]
def insertDottedNumber(tokens):
    level = len(tokens.level)
    if level > len(numberstack):
        numberstack.extend([1]*(level-len(numberstack)))
    else:
        del numberstack[level:]
        numberstack[level-1] += 1

    dottedNum = '.'.join(map(str,numberstack))

    # return the updated string containing the original level and rest
    # of the line, with the dotted number inserted
    return "%s %s %s" % (tokens.level, dottedNum, tokens[1])

# attach parse-time action callback to the sectionHeader expression
sectionHeader.setParseAction(insertDottedNumber)

# use sectionHeader expression to transform the input source string
newsource = sectionHeader.transformString(source)
print newsource

打印所需的内容:

# 1  Section 
## 1.1  Subsection 
## 1.2  Subsection 
# 2  Section 
## 2.1  Subsection #
### 2.1.1  Subsubsection 
### 2.1.2  Subsubsection 
# 3  Section 
## 3.1  Subsection 

答案 3 :(得分:0)

这不仅仅是正则表达式的工作,但您可以使用它们来简化工作。例如,这会使用正则表达式将全文拆分为主要部分:

>>> p = re.compile(r"^# .*\n^(?:^##.*\n)*", re.M)
>>> p.findall(your_text)
['# Section\n## Subsection\n## Subsection\n', '# Section\n## Subsection\n### Subsubsection\n### Subsubsection\n', '# Section\n']

你可以想象用这样的正则表达式做一些递归的东西来进一步分割子部分,但你最好只是循环遍历这些行。

答案 4 :(得分:0)

import re
import textwrap

class DefaultList(list):
    """
    List having a default value (returned on invalid offset)

    >>> t = DefaultList([1,2,3], default=17)
    >>> t[104]
    17
    """
    def __init__(self, *args, **kwargs):
        self.default = kwargs.pop('default', None)
        super(DefaultList,self).__init__(*args, **kwargs)

    def __getitem__(self, y):
        if y >= self.__len__():
            return self.default
        else:
            return super(DefaultList,self).__getitem__(y)

class SectionNumberer(object):
    "Hierarchical document numberer"
    def __init__(self, LineMatcher, Numbertype_list, defaultNumbertype):
        """
        @param LineMatcher:       line matcher instance  (recognize section headings and parse them)
        @param Numbertype_list:   list of Number classes (do section numbering at each level)
        @param defaultNumbertype: default Number class   (if too few Number classes specified)
        """
        super(SectionNumberer,self).__init__()
        self.match   = LineMatcher
        self.types   = DefaultList(Numbertype_list, default=defaultNumbertype)
        self.numbers = []
        self.title   = ''

    def addSection(self, level, title):
        "Add new section"
        depth = len(self.numbers)
        if depth < level:
            for i in range(depth, level):
                self.numbers.append(self.types[i](1))
        else:
            self.numbers = self.numbers[:level]
            self.numbers[-1].inc()

        self.title = title

    def doLine(self, ln):
        "Process section numbering on single-line string"
        match = self.match(ln)
        if match==False:
            return ln
        else:
            self.addSection(*match)
            return str(self)

    def __call__(self, s):
        "Process section numbering on multiline string"
        return '\n'.join(self.doLine(ln) for ln in s.split('\n'))

    def __str__(self):
        "Get label for current section"
        section = '.'.join(str(n) for n in self.numbers)
        return "{0} {1}".format(section, self.title)

class LineMatcher(object):
    "Recognize section headers and parse them"
    def __init__(self, match):
        super(LineMatcher,self).__init__()
        self.match = re.compile(match)

    def __call__(self, line):
        """
        @param line: string

        Expects that self.match is a valid regex expression
        """
        match = re.match(self.match, line)
        if match:
            return len(match.group(1)), match.group(2)
        else:
            return False

# Recognize section headers that look like '### Section_title'
PoundLineMatcher = lambda: LineMatcher(r'([#]+) (.*)')

class Numbertype(object):
    def __init__(self, startAt=0, valueType=int):
        super(Numbertype,self).__init__()
        self.value = valueType(startAt)

    def inc(self):
        self.value += 1

    def __str__(self):
        return str(self.value)

class Roman(int):
    CODING = [
        (1000, 'M'),
        ( 900, 'CM'), ( 500, 'D'), ( 400, 'CD'), ( 100, 'C'),
        (  90, 'XC'), (  50, 'L'), (  40, 'XL'), (  10, 'X'),
        (   9, 'IX'), (   5, 'V'), (   4, 'IV'), (   1, 'I')
    ]

    def __add__(self, y):
        return Roman(int.__add__(self, y))

    def __str__(self):
        value = self.__int__()
        if 0 < value < 4000:
            result = []
            for v,s in Roman.CODING:
                while v <= value:
                    value -= v
                    result.append(s)
            return ''.join(result)
        else:
            raise ValueError("can't generate Roman numeral string for {0}".format(value))

IntNumber = Numbertype
RomanNumber = lambda x=1: Numbertype(x, Roman)

def main():
    test = textwrap.dedent("""
        # Section
        ## Subsection
        ## Subsection
        # Section
        ## Subsection
        ### Subsubsection
        ### Subsubsection
        # Section
        ## Subsection
    """)

    numberer = SectionNumberer(PoundLineMatcher(), [IntNumber, RomanNumber, IntNumber], IntNumber)
    print numberer(test)

if __name__=="__main__":
    main()

# Section
## Subsection
## Subsection
# Section
## Subsection
### Subsubsection
### Subsubsection
# Section
## Subsection

1 Section
1.I Subsection
1.II Subsection
2 Section
2.I Subsection
2.I.1 Subsubsection
2.I.2 Subsubsection
3 Section
3.I Subsection

答案 5 :(得分:0)

通过eyquem使用那个生成器技巧。如果不是,你总是可以在全局上下文中查找所有内容,然后在新缓冲区中重写内容。

如果它只是一次性的东西,这个Perl样品可以完成所有工作......

use strict;
use warnings;

my $data = '
 # 
 ## 
 ## 
 # 
 ## 
 ### 
 ### 
 ###### 
 ##### 
 ####  
 ##### 
 #### 
 ##### 
 ###### 
 ##### 
 ## 
 # 
 ## 
 ';

my @cnts = ();

$data =~ s/^ [^\S\n]* (\#+) [^\S\n]* (.*) $/ callback($1,$2) /xemg;

print $data;

exit(0);

##
 sub callback {
    my ($pounds, $text) = @_;
    my $i = length($pounds) - 1;
    if ($i == 0 || $i <= $#cnts) {
        @cnts[ ($i+1) .. $#cnts ] = (0) x ($#cnts - $i);
        ++$cnts[ $i ];
    }
    else {
        @cnts[ ($#cnts+1) .. $i ] = (1) x ($i - $#cnts);
    }
    my $chapter = $cnts[0];
    for my $ndx (1 .. $i) {
        $chapter .= ".$cnts[ $ndx]";
    }
    return "$pounds \t $chapter $text";
 }

输出:

#        1
##       1.1
##       1.2
#        2
##       2.1
###      2.1.1
###      2.1.2
######   2.1.2.1.1.1
#####    2.1.2.1.2
####     2.1.2.2
#####    2.1.2.2.1
####     2.1.2.3
#####    2.1.2.3.1
######   2.1.2.3.1.1
#####    2.1.2.3.2
##       2.2
#        3
##       3.1

我,所有有帮助的人都在SO