Question

我需要解析格式错误的HTML页面并从中提取某些URL作为任何类型的集合。我真的不在乎什么样的Collection，我只需要能够迭代它。

假设我们有这样的结构：

<html>
  <body>
    <div class="outer">
      <div class="inner">
        <a href="http://www.google.com" title="Google">Google-Link</a>
        <a href="http://www.useless.com" title="I don't need this">Blah blah</a>
      </div>
      <div class="inner">
        <a href="http://www.youtube.com" title="Youtube">Youtube-Link</a>
        <a href="http://www.useless2.com" title="I don't need this2">Blah blah2</a>
      </div>
    </div>
  </body>
</html>

以下是我到目前为止所做的事情：

// tagsoup version 1.2 is under apache license 2.0
@Grab(group='org.ccil.cowan.tagsoup', module='tagsoup', version='1.2' )
XmlSlurper slurper = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser());

GPathResult nodes = slurper.parse("test.html"); 
def links = nodes."**".findAll { it.@class == "inner" }
println links

我想要的东西： [“http://google.com”，“http://youtube.com”] 但我得到的是： [“Google-LinkBlah blah”，“Youtube-LinkBlah blah2”]

更准确地说，我不能使用所有的URL，因为HTML文档，我需要解析大约有1.5万行，并且有很多我不需要的URL。所以我需要在每个“内部”块中使用第一个 URL。

Answer 1

正如Trav所说，您需要从每个匹配的href代码中获取a属性。

您已对问题进行了编辑，因此class中的findAll位没有意义，但使用当前的HTML示例时，这应该可行：

def links = nodes.'**'.findAll { it.name() == 'a' }*.@href*.text()

修改

如果（正如您在编辑后所说），您只想在标有a的内容中找到第一个class="inner"，请尝试：

def links = nodes.'**'.findAll { it.@class?.text() == 'inner' }
                 .collect { d -> d.'**'.find { it.name() == 'a' }?.@href }
                 .findAll() // remove nulls if there are any

Answer 2

您正在每个节点上寻找@href

在groovy中从href-tag中提取URL

2 个答案:

修改