如何使用正则表达式基于模式将大文档分解为小文档?

时间:2019-03-17 21:09:01

标签: python regex

大文档由小文档组成,这些小文档以“ 1435个文档中的1个”之类的模式分隔。我想将其分解为1435个小文档。

 re_1 =  r"\d{1,4} of \d{1,4} DOCUMENTS.+?"

 re_2 =  r"\d{1,4} of \d{1,4} DOCUMENTS.+"

re_1仅给我“ 1435个文档中的1个”等。 re_2给了我整个文档。

是否可以通过适当的正则表达式使用re.findall? 还是我必须做一个re.split(在这种情况下这是最简单的),或者循环遍历每一行并检查模式?谢谢!

1 of 1435 DOCUMENTS
blabla (multiple lines)

2 of 1435 DOCUMENTS
blabla(multiple lines)
3 of 1435 DOCUMENTS
blabla(multiple lines)
4 of 1435 DOCUMENTS
blabla(multiple lines)

5 of 1435 DOCUMENTS
....

1 个答案:

答案 0 :(得分:1)

With earlier versions of Python prior to 3.7 you can use re.findall with

r'(?sm)^\d{1,4} of \d{1,4} DOCUMENTS.*?(?=^\d{1,4} of \d{1,4} DOCUMENTS|\Z)'

See the regex demo

Details

  • (?sm) - re.M and re.S options on
  • ^ - start of the line
  • \d{1,4} of \d{1,4} DOCUMENTS - 1 to 4 digits, space, of, space, 1 to 4 digits, space and DOCUMENTS substring
  • .*? - any 0 or more chars, as few as possible up to the closest
  • (?=^\d{1,4} of \d{1,4} DOCUMENTS|\Z) - ^\d{1,4} of \d{1,4} DOCUMENTS pattern or (|) the end of the string (\Z).

See the Python demo:

import re
s = "TEXT_HERE"
print(re.findall(r'^\d{1,4} of \d{1,4} DOCUMENTS.*?(?=\d{1,4} of \d{1,4} DOCUMENTS|\Z)', s, re.M | re.S))
# => ['1 of 1435 DOCUMENTS\nblabla (multiple lines)\n\n', '2 of 1435 DOCUMENTS\nblabla(multiple lines)\n', '3 of 1435 DOCUMENTS\nblabla(multiple lines)\n', '4 of 1435 DOCUMENTS\nblabla(multiple lines)\n\n', '5 of 1435 DOCUMENTS\n....']

With Python 3.7, where re.split can split with zero length matches you may use

r'(?m)(?!\A)(?=^\d{1,4} of \d{1,4} DOCUMENTS)'

See the regex demo.

Details

  • (?m) - re.M option is on
  • (?!\A) - not at the start of the string -(?=^\d{1,4} of \d{1,4} DOCUMENTS) - immediately to the right, there must be start of a line, 1 to 4 digits, space, of, space, 1 to 4 digits, space and DOCUMENTS substring

Usage:

re.split(r'(?!\A)(?=^\d{1,4} of \d{1,4} DOCUMENTS)', text, flags=re.M)