使用Nokogiri从表中解析链接

时间:2014-11-11 12:30:43

标签: ruby nokogiri

这就是我要做的事情:

  1. 查找带有特定内部文本的span标记。
  2. 获取此span标记后面的表格。
  3. 检索此表中的所有href链接。
  4. 例如,我采用了Wiki页面的来源:

    <h2><span class="mw-headline" id="Filmography">Filmography</span><span class="mw-editsection">           <span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Katie_Holmes&amp;action=edit&amp;section=10" title="Edit section: Filmography">edit</a><span class="mw-editsection-bracket">]</span></span></h2>
    <table class="wikitable sortable plainrowheaders">
    <caption>Film</caption>
    <tr>
     <th scope="col">Year</th>
     <th scope="col">Title</th>
     <th scope="col">Role</th>
     <th scope="col" class="unsortable">Notes</th>
    </tr>
    <tr>
     <td style="text-align:center;">1997</td>
     <th scope="row"><i><span class="sortkey">Ice Storm, The</span><span class="vcard"><span class="fn"><a href="/wiki/The_Ice_Storm_(film)" title="The Ice Storm (film)">The Ice Storm</a></span> </span></i></th>
     <td>Libbets Casey</td>
     <td>First professional role</td>
    </tr>
    <tr>
     <td style="text-align:center;">1998</td>
     <th scope="row"><i><a href="/wiki/Disturbing_Behavior" title="Disturbing Behavior">Disturbing Behavior</a></i></th>
     <td>Rachel Wagner</td>
     <td><a href="/wiki/MTV_Movie_Award_for_Best_Breakthrough_Performance" title="MTV Movie Award for Best Breakthrough Performance">MTV Movie Award for Best Breakthrough Performance</a><br />
     Nominated–<a href="/wiki/Saturn_Award_for_Best_Performance_by_a_Younger_Actor" title="Saturn Award for Best Performance by a Younger Actor">Saturn Award for Best Performance by a Younger Actor</a>     </td>
    </tr>
    

    我想查找带有文字'Filmography'的<span>标签,然后从下表中检索所有电影链接。

    我可以这样做吗?

1 个答案:

答案 0 :(得分:1)

使用Nokogiri#css选择的解决方案。 (可能不是最有效的方式,但它有效)

require 'open-uri'
require 'nokogiri'

page = Nokogiri::HTML(open("http://en.wikipedia.org/w/index.php?title=Katie_Holmes&amp;action=edit&amp;section=10"))
puts page.css('span.mw-headline#Filmography').text

page.css('table').each do |tab|
  if tab.css('caption').text == "Film"
    tab.css('th').css('a').each do |a|
      puts "Title: #{a['title']} URL:#{a['href']}"
    end
  end
end


#=> Filmography
#=> Title: The Ice Storm (film) URL:/wiki/The_Ice_Storm_(film)
#=> Title: Disturbing Behavior URL:/wiki/Disturbing_Behavior
#=> .....So on