我正试图从http://en.wikipedia.org/wiki/List_of_current_NBA_team_rosters
中搜集所有玩家名单以下是我的新手代码:
class AllPlayersScraper
attr_accessor :players, :names, :links
def initialize(url)
@players = Nokogiri::HTML(open(url))
end
def get_names
@names = @players.css('table[class^="sortable"]')
# @names = @players.css("div.span2 a").href
end
end
require_relative './config/environment.rb'
rawfeed = "http://en.wikipedia.org/wiki/List_of_current_NBA_team_rosters"
scraper = AllPlayersScraper.new(rawfeed)
nbalist = scraper.get_names
这是我遇到麻烦的HTML大块。我不确定如何深入研究我需要的第三个<td>
。
<table class="sortable jquery-tablesorter" style=....>
<thead>
// bunch of html...
</thead>
<tbody>
<tr>
<td style="text-align:center;"><span style="display:none" class="sortkey">5.5 !</span><span class="sorttext"><a href="/wiki/Forward-center" title="Forward-center">F/C</a></span></td>
<td style="text-align:center;">50</td>
<td style="text-align:left;"><a href="/wiki/Lavoy_Allen" title="Lavoy Allen">Allen, Lavoy</a></td>
<td><span style="display:none" class="sortkey">81 !</span><span class="sorttext">6 ft 9 in</span> (2.06 m)</td>
<td>255 lb (116 kg)</td>
<td style="text-align:center;">1989–02–04</td>
<td><a href="/wiki/Temple_University" title="Temple University">Temple</a></td>
</tr>
谢谢!
答案 0 :(得分:3)
很长一段时间没有使用 Nokogiri ,但这有效:
rawfeed = "http://en.wikipedia.org/wiki/List_of_current_NBA_team_rosters"
@page = Nokogiri::HTML(open(rawfeed))
@all_teams = @page.css('table.toccolours')
@parsed_teams = []
@all_teams.each do |t|
team = {}
# team name
team["name"] = (t.css('tr')[0].css('b').text).gsub(" roster", "")
team_players_rows = t.css('table.sortable tr')
team["players"] = []
# Skip header and iterate over players
team_players_rows.drop(1).each do |tp|
team["players"].push(tp.css('td')[2].css('a').text)
end
@parsed_teams << team
end
@parsed_teams
将是一个数组,其值为:
[{"name"=>"Boston Celtics",
"players"=>["Bass, Brandon", "Bogans, Keith", "Bradley, Avery",
"Brooks, MarShon", "Crawford, Jordan", "Faverani, Vítor", "Green,
Jeff", "Humphries, Kris", "Lee, Courtney", "Olynyk, Kelly", "Pressey, Phil",
"Rondo, Rajon", "Sullinger, Jared", "Wallace, Gerald"]},
{"name"=>"Brooklyn Nets",...]