正则表达式重复

时间:2018-05-03 09:42:18

标签: regex python-3.x

我想用正则表达式查找章节编号。 如果我有以下字符串,我想过滤3.结果和3.1。结果没有5.被删除。

import re
MyStr = """ test 5.

3. Results

3.1. Result 2

3.3.1 Result

test test test test"""

print(repr(MyStr))
match = re.findall(r"(?:\d[ \t]*?).+?\n\n", MyStr, re.DOTALL|re.MULTILINE)
print(match)

但是,我无法区分测试5和3。 问题:我如何告诉正则表达式。不能跟\ n后面只有第一个跟随字符。我尝试以多种方式添加[\ t],但没有成功。正则表达式应该仍然足够灵活,可以过滤掉任何形式的3.

非常感谢您的帮助。

斯泰恩

1 个答案:

答案 0 :(得分:2)

我不确定您的编号系统的限制。无论如何,以下代码适用于您的示例,对我有用:

import re

MyStr = """ test 5.

3. Results

3.1. Result 2

3.3.1 Result

test test test test"""

str_list = re.findall(r'^(?:\d+\.)+.*?$', MyStr, re.MULTILINE)
for s in str_list:
    print(s)

这是一个改进版本,能够处理修改后的目录中的所有案例。

import re

MyStr = """Table of Contents ...

1. 1st title

20. 1-line title

300. 2-lines title ...
   ... continued here

300.1. 1-line subtitle

300.2. 2-lines subtitle ...
   ... continued here

300.3.1 title, not followed by a blank line
300.3.20 next title omitted and no trailing period
300.3.31
300.3.45 next title omitted and trailing period
300.3.56.

4000. last title

999 Lorem ipsum dolor sit amet, consectetur adipisici elit,
sed eiusmod tempor incidunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi
ut aliquid ex ea commodi consequat.

Quis aute iure reprehenderit in voluptate velit esse
cillum dolore eu fugiat nulla pariatur.
Excepteur sint obcaecat cupiditat non proident,
sunt in culpa qui officia deserunt mollit anim id est laborum. 

... followed by arbitrary text, which must not start with (a) digit(s) followed by a period"""

str_list = re.findall(r'''
                           ^       # start of line
                           (?:     # uncaptured ...
                               \d+     # 1 or more decimal digits
                               \.      # period
                           )+      # ... expression, repeated 1 or more times
                           .*?     # minimal number of any characters
                           $       # end of line
                           ^       # start of line
                           .*?     #  minimal number of any characters
                           $       # end of line
                     ''', MyStr, re.MULTILINE | re.DOTALL | re.VERBOSE)
for s in str_list:
    print(s, end='')