Question

我有一个txt文件，我从包含很长项目列表的pdf转换而来。这些项目的编号约定如下：

[A-Z]{1,2}\d{1,2}\.\d{1,2}\.\d{1,2}

此表达式将匹配以下内容：

A1.1.1

和

ZZ99.99.99

这很好用。我遇到的问题是我试图在组1中捕获这个以及组2中每个项目编号（项目描述）之间的所有内容。

我还需要将这些作为列表或可迭代的方式返回，以便最终将捕获的内容导出到Excel电子表格。

这是我目前的正则表达式：

^([A-Z]{1,2}\d{1,2}\.\d{1,2}\.\d{1,2}\s)([\w\W]*?)(?:\n)

点击此链接查找我所拥有的样本和我面临的问题：

Debuggex Demo

是否有人能够帮助我弄清楚如何捕捉每个数字之间的所有内容，无论多少段落？

非常感谢任何意见，谢谢！

Answer 1

你非常接近：

import re

s = """
A1.2.1 This is the first paragraph of the description that is being captured by the regex even if the description contains multiple lines of text.ZZ99.99.99
"""
final_data = re.findall("[A-Z]{1,2}\d{1,2}\.\d{1,2}\.\d{1,2}(.*?)[A-Z]{1,2}\d{1,2}\.\d{1,2}\.\d{1,2}", s)

输出：

[' This is the first paragraph of the description that is being captured by the regex even if the description contains multiple lines of text.']

使用(.*?)，您可以匹配第一个正则表达式定义的字母和数字之间的任何文字。

如何捕获两个捕获的组之间的所有内容

1 个答案: