I am scraping some data whos heirarchy is /h2/a
but a's href should contain http://www.thedomain.com
. All links are something like this:
thedomain.com/test
and so on. Right now I get the text only but not the name of the href link itself.
For example:
<h2>
<a href="http://www.thedomain.com/test">Hey there</a>
<a href="http://www.thedomain.com/test1">2nd link</a>
<a href="http://www.thedomain.com/test2">3rd link</a>
</h2>
Here is my code:
html_doc.xpath('//h2/a[contains(@href, "http://www.thedomain.com")]/text()')
Hey there, 2nd link, 3rd link
Whereas I want http://www.thedomain.com/test
and so on.
答案 0 :(得分:1)
只需获取@href
而不是text()
:
//h2/a[contains(@href, "http://www.thedomain.com")]/@href
答案 1 :(得分:1)
为此,您还可以使用CSS选择器(在这种情况下可能比xpath
更容易使用)。您可以使用以下选项<a>
下的h2
元素
html_doc.css('h2 a')
这是代码的完整工作版本:
html = <<EOT
<html>
<h2>
<a href="http://www.thedomain.com/test">Hey there</a>
<a href="http://www.thedomain.com/test1">2nd link</a>
<a href="http://www.thedomain.com/test2">3rd link</a>
</h2>
</html>
EOT
html_doc = Nokogiri::HTML(html)
html_doc.css('h2 a').map { |link| p link['href'] }
# => "http://www.thedomain.com/test"
# => "http://www.thedomain.com/test1"
# => "http://www.thedomain.com/test2"