我有以下html,它有几个重复的href。如何仅提取唯一链接
<div class="pages">
<a href="/search_results.aspx?f=Technology&Page=1" class="active">1</a>
<a href="/search_results.aspx?f=Technology&Page=2">2</a>
<a href="/search_results.aspx?f=Technology&Page=3">3</a>
<a href="/search_results.aspx?f=Technology&Page=4">4</a>
<a href="/search_results.aspx?f=Technology&Page=5">5</a>
<a href="/search_results.aspx?f=Technology&Page=2">next ›</a>
<a href="/search_results.aspx?f=Technology&Page=6">last »</a>
</div>
# p => is the page that has this html
# The below gives 7 as expected. But I don't need next/last links as they are duplicate
p.css(".pages a").count
#So I tried uniq which obviously didnt work
p.css(".pages").css("a").uniq #=> didn't work
p.css(".pages").css("a").to_a.uniq #=> didn't work
答案 0 :(得分:4)
尝试从匹配元素(el.attr('href')
)中提取“href”属性:
html = Nokogiri::HTML(your_html_string)
html.css('a').map { |el| el.attr('href') }.uniq
# /search_results.aspx?f=Technology&Page=1
# /search_results.aspx?f=Technology&Page=2
# /search_results.aspx?f=Technology&Page=3
# /search_results.aspx?f=Technology&Page=4
# /search_results.aspx?f=Technology&Page=5
# /search_results.aspx?f=Technology&Page=6
答案 1 :(得分:3)
使用#xpath
可以完成同样的操作。我会这样做:
require 'nokogiri'
doc = Nokogiri::HTML::Document.parse <<-HTML
<div class="pages">
<a href="/search_results.aspx?f=Technology&Page=1" class="active">1</a>
<a href="/search_results.aspx?f=Technology&Page=2">2</a>
<a href="/search_results.aspx?f=Technology&Page=3">3</a>
<a href="/search_results.aspx?f=Technology&Page=4">4</a>
<a href="/search_results.aspx?f=Technology&Page=5">5</a>
<a href="/search_results.aspx?f=Technology&Page=2">next ›</a>
<a href="/search_results.aspx?f=Technology&Page=6">last »</a>
</div>
HTML
doc.xpath("//a/@href").map(&:to_s).uniq
# => ["/search_results.aspx?f=Technology&Page=1",
# "/search_results.aspx?f=Technology&Page=2",
# "/search_results.aspx?f=Technology&Page=3",
# "/search_results.aspx?f=Technology&Page=4",
# "/search_results.aspx?f=Technology&Page=5",
# "/search_results.aspx?f=Technology&Page=6"]
答案 2 :(得分:0)
执行相同工作的另一种方法,即在xpath
表达式本身处理唯一值选择:
require 'nokogiri'
doc = Nokogiri::HTML::Document.parse <<-HTML
<div class="pages">
<a href="/search_results.aspx?f=Technology&Page=1" class="active">1</a>
<a href="/search_results.aspx?f=Technology&Page=2">2</a>
<a href="/search_results.aspx?f=Technology&Page=3">3</a>
<a href="/search_results.aspx?f=Technology&Page=4">4</a>
<a href="/search_results.aspx?f=Technology&Page=5">5</a>
<a href="/search_results.aspx?f=Technology&Page=2">next ›</a>
<a href="/search_results.aspx?f=Technology&Page=6">last »</a>
</div>
HTML
doc.xpath("//a[not(@href = preceding-sibling::a/@href)]/@href").map(&:to_s)
# => ["/search_results.aspx?f=Technology&Page=1",
# "/search_results.aspx?f=Technology&Page=2",
# "/search_results.aspx?f=Technology&Page=3",
# "/search_results.aspx?f=Technology&Page=4",
# "/search_results.aspx?f=Technology&Page=5",
# "/search_results.aspx?f=Technology&Page=6"]