这就是我要做的事情:
例如,我采用了Wiki页面的来源:
<h2><span class="mw-headline" id="Filmography">Filmography</span><span class="mw-editsection"> <span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Katie_Holmes&action=edit&section=10" title="Edit section: Filmography">edit</a><span class="mw-editsection-bracket">]</span></span></h2>
<table class="wikitable sortable plainrowheaders">
<caption>Film</caption>
<tr>
<th scope="col">Year</th>
<th scope="col">Title</th>
<th scope="col">Role</th>
<th scope="col" class="unsortable">Notes</th>
</tr>
<tr>
<td style="text-align:center;">1997</td>
<th scope="row"><i><span class="sortkey">Ice Storm, The</span><span class="vcard"><span class="fn"><a href="/wiki/The_Ice_Storm_(film)" title="The Ice Storm (film)">The Ice Storm</a></span> </span></i></th>
<td>Libbets Casey</td>
<td>First professional role</td>
</tr>
<tr>
<td style="text-align:center;">1998</td>
<th scope="row"><i><a href="/wiki/Disturbing_Behavior" title="Disturbing Behavior">Disturbing Behavior</a></i></th>
<td>Rachel Wagner</td>
<td><a href="/wiki/MTV_Movie_Award_for_Best_Breakthrough_Performance" title="MTV Movie Award for Best Breakthrough Performance">MTV Movie Award for Best Breakthrough Performance</a><br />
Nominated–<a href="/wiki/Saturn_Award_for_Best_Performance_by_a_Younger_Actor" title="Saturn Award for Best Performance by a Younger Actor">Saturn Award for Best Performance by a Younger Actor</a> </td>
</tr>
我想查找带有文字'Filmography'的<span>
标签,然后从下表中检索所有电影链接。
我可以这样做吗?
答案 0 :(得分:1)
使用Nokogiri#css选择的解决方案。 (可能不是最有效的方式,但它有效)
require 'open-uri'
require 'nokogiri'
page = Nokogiri::HTML(open("http://en.wikipedia.org/w/index.php?title=Katie_Holmes&action=edit&section=10"))
puts page.css('span.mw-headline#Filmography').text
page.css('table').each do |tab|
if tab.css('caption').text == "Film"
tab.css('th').css('a').each do |a|
puts "Title: #{a['title']} URL:#{a['href']}"
end
end
end
#=> Filmography
#=> Title: The Ice Storm (film) URL:/wiki/The_Ice_Storm_(film)
#=> Title: Disturbing Behavior URL:/wiki/Disturbing_Behavior
#=> .....So on