所以我试着看下面的文字:
This is some header 1
nonsense text 1
Repeated item 1
Repeated item 1 Data
nonsense text 1
Repeated item 2
Repeated item 2 Data
This is some header 2
nonsense text 1
Repeated item 1
Repeated item 1 Data
nonsense text 1
Repeated item 2
Repeated item 2 Data
我正在尝试捕获重复的项目,并捕获前面标题中的数字,如下所示:
This is some header 1
Repeated item 1
Repeated item 1 Data
Repeated item 2
Repeated item 2 Data
This is some header 2
Repeated item 1
Repeated item 1 Data
Repeated item 2
Repeated item 2 Data
使用此功能捕获重复的项目没有问题:
Repeated Item ([0-9]+)\sSome item data: (.*)
但是,对于每个重复的项目,我还想在它之前捕获标题(但这个正则表达式不起作用):
This is some header ([0-9]+).*Repeated Item ([0-9]+)\sSome item data: (.*)
我还尝试了以下正则表达式,它是本文正上方的衍生物:
(?sm)This is some header ([0-9]+).*Repeated Item ([0-9]+)\sSome item data: (.*)
但是,上面的正则表达式只捕获第一个标题和最少重复的项目。有没有办法通过正则表达式来实现我想要实现的目标?我显然可以逐行手动解析文本,但我希望我能用正则表达式实现这一点。
答案 0 :(得分:0)
针对您的示例进行了更新:
/^(This is some header \d+)[\s\S]+?^(Repeated item \d+)\s*^(Repeated item.*)[\s\S]+?(Repeated item \d+)\s*^(Repeated item.*)/m
Perl示例:
$ txt='This is some header 1
>
> nonsense text 1
>
>
> Repeated item 1
> Repeated item 1 Data
>
> nonsense text 1
>
>
> Repeated item 2
> Repeated item 2 Data
>
> This is some header 2
>
> nonsense text 1
>
> Repeated item 1
> Repeated item 1 Data
>
> nonsense text 1
>
> Repeated item 2
> Repeated item 2 Data'
$ echo "$txt" | perl -0777 -lne 'while (/^(This is some header \d+)[\s\S]+?^(Repeated item \d+)\s*^(Repeated item.*)[\s\S]+?(Repeated item \d+)\s*^(Repeated item.*)/gm) {print "$1\n$2\n$3\n$4\n$5\n\n" }'
This is some header 1
Repeated item 1
Repeated item 1 Data
Repeated item 2
Repeated item 2 Data
This is some header 2
Repeated item 1
Repeated item 1 Data
Repeated item 2
Repeated item 2 Data
更强大的方法是首先将文本分解为块,然后将重复项分开。