大文档由小文档组成,这些小文档以“ 1435个文档中的1个”之类的模式分隔。我想将其分解为1435个小文档。
re_1 = r"\d{1,4} of \d{1,4} DOCUMENTS.+?"
re_2 = r"\d{1,4} of \d{1,4} DOCUMENTS.+"
re_1仅给我“ 1435个文档中的1个”等。 re_2给了我整个文档。
是否可以通过适当的正则表达式使用re.findall? 还是我必须做一个re.split(在这种情况下这是最简单的),或者循环遍历每一行并检查模式?谢谢!
1 of 1435 DOCUMENTS
blabla (multiple lines)
2 of 1435 DOCUMENTS
blabla(multiple lines)
3 of 1435 DOCUMENTS
blabla(multiple lines)
4 of 1435 DOCUMENTS
blabla(multiple lines)
5 of 1435 DOCUMENTS
....
答案 0 :(得分:1)
With earlier versions of Python prior to 3.7 you can use re.findall
with
r'(?sm)^\d{1,4} of \d{1,4} DOCUMENTS.*?(?=^\d{1,4} of \d{1,4} DOCUMENTS|\Z)'
See the regex demo
Details
(?sm)
- re.M
and re.S
options on^
- start of the line\d{1,4} of \d{1,4} DOCUMENTS
- 1 to 4 digits, space, of
, space, 1 to 4 digits, space and DOCUMENTS
substring.*?
- any 0 or more chars, as few as possible up to the closest(?=^\d{1,4} of \d{1,4} DOCUMENTS|\Z)
- ^\d{1,4} of \d{1,4} DOCUMENTS
pattern or (|
) the end of the string (\Z
).See the Python demo:
import re
s = "TEXT_HERE"
print(re.findall(r'^\d{1,4} of \d{1,4} DOCUMENTS.*?(?=\d{1,4} of \d{1,4} DOCUMENTS|\Z)', s, re.M | re.S))
# => ['1 of 1435 DOCUMENTS\nblabla (multiple lines)\n\n', '2 of 1435 DOCUMENTS\nblabla(multiple lines)\n', '3 of 1435 DOCUMENTS\nblabla(multiple lines)\n', '4 of 1435 DOCUMENTS\nblabla(multiple lines)\n\n', '5 of 1435 DOCUMENTS\n....']
With Python 3.7, where re.split
can split with zero length matches you may use
r'(?m)(?!\A)(?=^\d{1,4} of \d{1,4} DOCUMENTS)'
See the regex demo.
Details
(?m)
- re.M
option is on(?!\A)
- not at the start of the string
-(?=^\d{1,4} of \d{1,4} DOCUMENTS)
- immediately to the right, there must be start of a line, 1 to 4 digits, space, of
, space, 1 to 4 digits, space and DOCUMENTS
substringUsage:
re.split(r'(?!\A)(?=^\d{1,4} of \d{1,4} DOCUMENTS)', text, flags=re.M)