多行正则表达式 - 捕获重复组的问题

时间:2017-06-26 21:57:07

标签: regex pcre multiline

所以我试着看下面的文字:

This is some header 1

nonsense text 1


Repeated item 1
Repeated item 1 Data

nonsense text 1


Repeated item 2
Repeated item 2 Data

This is some header 2

nonsense text 1

Repeated item 1
Repeated item 1 Data

nonsense text 1

Repeated item 2
Repeated item 2 Data

我正在尝试捕获重复的项目,并捕获前面标题中的数字,如下所示:

This is some header 1
Repeated item 1
Repeated item 1 Data
Repeated item 2
Repeated item 2 Data

This is some header 2
Repeated item 1
Repeated item 1 Data
Repeated item 2
Repeated item 2 Data

使用此功能捕获重复的项目没有问题:

Repeated Item ([0-9]+)\sSome item data: (.*)

但是,对于每个重复的项目,我还想在它之前捕获标题(但这个正则表达式不起作用):

This is some header ([0-9]+).*Repeated Item ([0-9]+)\sSome item data: (.*)

我还尝试了以下正则表达式,它是本文正上方的衍生物:

(?sm)This is some header ([0-9]+).*Repeated Item ([0-9]+)\sSome item data: (.*)

但是,上面的正则表达式只捕获第一个标题和最少重复的项目。有没有办法通过正则表达式来实现我想要实现的目标?我显然可以逐行手动解析文本,但我希望我能用正则表达式实现这一点。

1 个答案:

答案 0 :(得分:0)

针对您的示例进行了更新:

/^(This is some header \d+)[\s\S]+?^(Repeated item \d+)\s*^(Repeated item.*)[\s\S]+?(Repeated item \d+)\s*^(Repeated item.*)/m

Demo

Perl示例:

$ txt='This is some header 1
> 
> nonsense text 1
> 
> 
> Repeated item 1
> Repeated item 1 Data
> 
> nonsense text 1
> 
> 
> Repeated item 2
> Repeated item 2 Data
> 
> This is some header 2
> 
> nonsense text 1
> 
> Repeated item 1
> Repeated item 1 Data
> 
> nonsense text 1
> 
> Repeated item 2
> Repeated item 2 Data'

$ echo "$txt" | perl -0777 -lne 'while (/^(This is some header \d+)[\s\S]+?^(Repeated item \d+)\s*^(Repeated item.*)[\s\S]+?(Repeated item \d+)\s*^(Repeated item.*)/gm) {print "$1\n$2\n$3\n$4\n$5\n\n" }'
This is some header 1
Repeated item 1
Repeated item 1 Data
Repeated item 2
Repeated item 2 Data

This is some header 2
Repeated item 1
Repeated item 1 Data
Repeated item 2
Repeated item 2 Data

更强大的方法是首先将文本分解为块,然后将重复项分开。