Question

我想尝试匹配span标记之间字符串的内部部分，以确保此span标记的id以blk开头。

我如何将它与groovy相匹配？

示例：

<p>I wanted to try to <span id="blk1">match</span> the inner part of the string<span id="blk2"> between </span>the span tags <span>where</span> it is guaranteed that the id of this span tags <span id="blk3">starts</span> with blk.</p>

根据上面的例子，我想要

   match
   between
   starts

我尝试了以下操作，但它返回null;

 def html='''<p>I wanted to try to <span id="blk1">match</span> the inner part of the string<span id="blk2"> between </span>the span tags <span>where</span> it is guaranteed that the id of this span tags <span id="blk3">starts</span> with blk.</p>''' 

 html=html.findAll(/<span id="blk(.)*">(.)*<\/span>/).join();
 println html;

Answer 1

为什么不解析HTML然后从中提取节点，而不是乱用正则表达式？

@Grab( 'net.sourceforge.nekohtml:nekohtml:1.9.18' )
import org.cyberneko.html.parsers.SAXParser

def html = '''<p>
             |  I wanted to try to <span id="blk1">match</span> the inner part
             |  of the string<span id="blk2"> between </span> the span tags <span>where</span>
             |  it is guaranteed that the id of this span tags <span id="blk3">starts</span>
             |  with blk.
             |</p>'''.stripMargin()

def content = new XmlSlurper( new SAXParser() ).parseText( html )

List<String> spans = content.'**'.findAll { it.name() == 'SPAN' && it.@id?.text()?.startsWith( 'blk' ) }*.text()

Answer 2

您的一方似乎span而另一方似乎strong。

另外要小心单独使用.*，因为它会一次性匹配大部分字符串，因为正则表达式是贪婪的。您通常应该使用.*?

使其变得懒惰

当您使用(.)*来匹配标记之间的文本时，您不会从该组中获取实际文本，而只会获取匹配的最后一个字符，您需要将量词放在匹配组中。

使用[^<>]+是一种更好的方法来匹配html标记之间的文本，并且类似于。*除了几个点。

它将匹配任何角色，除了＆＃34;＆lt;＆＃34;和＆＃34;＆gt;＆＃34;
至少需要匹配一个字符，因此它与空字符不匹配。

此外，如果你可以确保接下来的事情＆＃34; blk＆＃34;永远是一个整数，我建议使用\ d +来匹配它。

html=html.findAll(/<=span id="blk\d">([^<>]+)<\/span>/).join();

话虽如此，我对Groovy的经验不多，但您希望打印出包含这三个单词的列表吗？以下正则表达式也将从html中提取文本。

html=html.findAll(/(?<=span id="blk\d">)([^<>]+)(?=<\/span>)/).join();

groovy - 正则表达式检索内部html标记

2 个答案: