我想用正则表达式查找章节编号。 如果我有以下字符串,我想过滤3.结果和3.1。结果没有5.被删除。
import re
MyStr = """ test 5.
3. Results
3.1. Result 2
3.3.1 Result
test test test test"""
print(repr(MyStr))
match = re.findall(r"(?:\d[ \t]*?).+?\n\n", MyStr, re.DOTALL|re.MULTILINE)
print(match)
但是,我无法区分测试5和3。 问题:我如何告诉正则表达式。不能跟\ n后面只有第一个跟随字符。我尝试以多种方式添加[\ t],但没有成功。正则表达式应该仍然足够灵活,可以过滤掉任何形式的3.
非常感谢您的帮助。
斯泰恩
答案 0 :(得分:2)
我不确定您的编号系统的限制。无论如何,以下代码适用于您的示例,对我有用:
import re
MyStr = """ test 5.
3. Results
3.1. Result 2
3.3.1 Result
test test test test"""
str_list = re.findall(r'^(?:\d+\.)+.*?$', MyStr, re.MULTILINE)
for s in str_list:
print(s)
这是一个改进版本,能够处理修改后的目录中的所有案例。
import re
MyStr = """Table of Contents ...
1. 1st title
20. 1-line title
300. 2-lines title ...
... continued here
300.1. 1-line subtitle
300.2. 2-lines subtitle ...
... continued here
300.3.1 title, not followed by a blank line
300.3.20 next title omitted and no trailing period
300.3.31
300.3.45 next title omitted and trailing period
300.3.56.
4000. last title
999 Lorem ipsum dolor sit amet, consectetur adipisici elit,
sed eiusmod tempor incidunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi
ut aliquid ex ea commodi consequat.
Quis aute iure reprehenderit in voluptate velit esse
cillum dolore eu fugiat nulla pariatur.
Excepteur sint obcaecat cupiditat non proident,
sunt in culpa qui officia deserunt mollit anim id est laborum.
... followed by arbitrary text, which must not start with (a) digit(s) followed by a period"""
str_list = re.findall(r'''
^ # start of line
(?: # uncaptured ...
\d+ # 1 or more decimal digits
\. # period
)+ # ... expression, repeated 1 or more times
.*? # minimal number of any characters
$ # end of line
^ # start of line
.*? # minimal number of any characters
$ # end of line
''', MyStr, re.MULTILINE | re.DOTALL | re.VERBOSE)
for s in str_list:
print(s, end='')