Question

我想从网页中提取特定链接，使用Nokogiri通过文本搜索：

<div class="links">
   <a href='http://example.org/site/1/'>site 1</a>
   <a href='http://example.org/site/2/'>site 2</a>
   <a href='http://example.org/site/3/'>site 3</a>
</div>

我想要“网站3”的href并返回：

http://example.org/site/3/

或者我想要“网站1”的href并返回：

http://example.org/site/1/

我该怎么做？

Answer 1

<强>原始

text = <<TEXT
<div class="links">
  <a href='http://example.org/site/1/'>site 1</a>
  <a href='http://example.org/site/2/'>site 2</a>
  <a href='http://example.org/site/3/'>site 3</a>
</div>
TEXT

link_text = "site 1"

doc = Nokogiri::HTML(text)
p doc.xpath("//a[text()='#{link_text}']/@href").to_s

<强>更新

据我所知，Nokogiri的XPath实现不支持正则表达式，对于基本starts with匹配，有一个名为starts-with的函数可以像这样使用（以“s”开头的链接）：

doc = Nokogiri::HTML(text)
array_of_hrefs = doc.xpath("//a[starts-with(text(), 's')]/@href").map(&:to_s)

Answer 2

也许你会更喜欢css样式选择：

doc.at('a[text()="site 1"]')[:href] # exact match
doc.at('a[text()^="site 1"]')[:href] # starts with
doc.at('a[text()*="site 1"]')[:href] # match anywhere

Answer 3

require 'nokogiri'

text = "site 1"

doc = Nokogiri::HTML(DATA)
p doc.xpath("//div[@class='links']//a[contains(text(), '#{text}')]/@href").to_s

Answer 4

只是为了记录我们可以使用URI模块在Ruby中执行此操作的另一种方式：

require 'uri'

html = %q[
<div class="links">
    <a href='http://example.org/site/1/'>site 1</a>
    <a href='http://example.org/site/2/'>site 2</a>
    <a href='http://example.org/site/3/'>site 3</a>
</div>
]

uris = Hash[URI.extract(html).map.with_index{ |u, i| [1 + i, u] }]

=> {
    1 => "http://example.org/site/1/'",
    2 => "http://example.org/site/2/'",
    3 => "http://example.org/site/3/'"
}

uris[1]
=> "http://example.org/site/1/'"

uris[3]
=> "http://example.org/site/3/'"

在封面下URI.extract使用正则表达式，这不是在页面中查找链接的最强大的方法，但它非常好，因为URI通常是没有空格的字符串，如果它是是有用的。

从链接文本中提取Nokogiri的链接？

4 个答案: