如何使用mechanize和nokogiri ruby​​获取链接

时间:2015-04-17 18:27:19

标签: ruby web-scraping nokogiri mechanize

鉴于下面的示例,任何人都可以告诉我如何使用Nokogiri和Mechanize来获取每个<h4>标签下的所有链接,在不同的组中,I.E。以下所有链接:

  1. “some text”
  2. “更多文字”
  3. “一些额外的文字”
  4. <div id="right_holder">
        <h3><a href="#"><img src="http://example.com" width="11" height="11"></a></h3>
        <br />
        <br />
        <h4><a href="#">Some text</a></h4>
        <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
        <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
        <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
        <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
        <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
        <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
        <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
        <br />
        <br />
        <h4><a href="#">Some more text</a></h4>
        <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
        <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
        <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
        <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
        <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
        <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
        <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
        <br />
        <br />
        <h4><a href="#">Some additional text</a></h4>
        <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
        <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
        <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
        <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
        <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
        <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
        <a href="#" alt="name of item"><img src="http://some.image.com" class="class1"></a>
    </div>
    

2 个答案:

答案 0 :(得分:2)

一般情况下你会这样做:

page.search('h4 a').each do |a|
  puts a[:href]
end

但我相信你已经注意到这些链接实际上都没有。

<强>更新

要将它们分组到一些节点集数学:

page.search('h4').each do |h4|
  puts h4.text
  (h4.search('~ a') - h4.search('~ h4 ~ a')).each do |a|
    puts a.text
  end
end

这意味着a之后的每个h4都不会跟随另一个h4

答案 1 :(得分:1)

您可以查看和分离数据,例如&#34; How to split a HTML document using Nokogiri?&#34;但如果你知道标签是什么,你可以split

# html is the raw html string
html.split('<h4').map{|g| Nokogiri::HTML::DocumentFragment.parse(g).css('a') }

page = Nokogiri::HTML(html).css("#right_holder")
links = page.children.inject([]) do |link_hash, child|
  if child.name == 'h4'
    name = child.text
    link_hash << { :name => name, :content => ""}
  end

  next link_hash if link_hash.empty?
  link_hash.last[:content] << child.to_xhtml
  link_hash
end

grouped_hsh = links.inject({}) do |hsh, link|
  hsh[link[:name]] = Nokogiri::HTML::DocumentFragment.parse(link[:content]).css('a')
  hsh
end

# {"Some text"=>[#<Nokogiri::XML::Element:0x3ff4860d6c30,
#  "Some more text"=>[#<Nokogiri::XML::Element:0x3ff486096c20...,
#  "Some additional text"=>[#<Nokogiri::XML::Element:0x3ff486f2de78...}