假设我的纯文本在多行的纯文本文件中包含以下有序列表。
This is a text\n
that contains an ordered/numbered list\n
appearing on multiple lines in a plain-text file.\n
\n
Item 1. This is a list where each item can span over\n
multiple lines\n
Item 2. that I want to extract each separate item from but ONLY in series (order)\n
Item 3. non-blank text\n
Item 4. non-blank text\n
Item 5. non-blank text\n
Item 6. non-blank text\n
Item 7. non-blank text\n
Item 8. non-blank text\n
Item 9. non-blank text\n
Item 10. non-blank text\n
Item 11. The items are in an ordered list, but digits may repeat (11, 22)\n
or they may be preceded or folowed by another digit (20, 35, 300) with\n
...
Item 999. Up to 999 items\n
in each ordered list\n
\n
But, (most annoyingly), any Item n (with up to 3 digits) or Items may be repeated\n
or back-referenced later in text but not\n
again as an ordered list (or in series) as the first\n
instance of each item in the list above.
正则表达式所需的捕获/输出:
返回有序列表中显示的每个项目的文本(可能跨多行)。
项目1. [文本] \ n
第2项。[文字] \ n
[文字可能跨越多行]
项目N(最多999)。 [文本] \ n上
我目前最好的正则表达式结构如下:
(Item\s[\d]+\. )(.*?)(?=(Item\s[\d]+\.)|($))
上述正则表达式构造并不贪婪地在上面的有序列表中捕获的每个“项目”中包含换行符或多行。
我的问题:是否可以在Python中使用正则表达式提取只是有序列表中的项目?如果不能使用正则表达式,我怎样才能最有效地使用Python在这样的文本中“定位”有序列表并将其解压缩?
答案 0 :(得分:0)
使用DOTALL flag进行python正则表达式。
re.compile('(Item\s[\d]+\. )(.*?)(?=(Item\s[\d]+\.)|($))', re.DOTALL)