访问最近的表行及其数据

时间:2014-11-24 20:42:15

标签: ruby web-scraping nokogiri

我根据最后一场比赛的结果条件创建一个小应用程序,或者根据游戏数据的最后一行创建一个小应用程序(赢/输和游戏编号)。

我的问题是访问最后一行的第一列(最近玩过的游戏)。这是如何完成的?

Data source

require 'open-uri'

class BrooklynPizzaController < ApplicationController

  def index
    # URL for dynamic content
    url = "http://www.basketball-reference.com/teams/BRK/2015_games.html"

    # Open URL using nokogiri
    doc = Nokogiri::HTML(open(url))

    # Scrape result from Web site
    @result = doc.css("#teams_games").xpath("//table/tbody/tr/td[8]/text()")

    # IN PROGRESS - Get date of last game played
    @result_date = doc.xpath('//table/tbody/tr/td[2]/a/text()') do |link|
      @result_date[link.text.strip] = link['a']
    end


    ###############################################################
    # IN PROGRESS - Get number of last game played from 1st column
    # doc.xpath('//table/tbody/tr/td[1]/text()') do |game|
    #   last_game_number = 
    # end
    ################################################################

    # @result_date = doc.css("#teams_games").xpath("//table/tbody/tr/td[2]/text()")
    # Set date to current
    @date = Date.today

    # Get date of last game played
    if (@result.last.next == nil)
      flag = doc.xpath("//table/tbody/tr[#{@result}]")
      @result_date = doc.xpath("//table/tbody/tr#{flag}/td[2]/a/text()")
    end
  end
end

请让我知道我给你的信息缺乏,因为我觉得我遗漏了一些东西。

2 个答案:

答案 0 :(得分:1)

要获得该行,您可以执行此操作:

win_loss_tds = doc.css("#teams_games tbody tr td:nth-child(8):not(:empty)").last
last_win_loss_row = win_loss_tds.last.parent

毫无疑问,在单个XPath表达式中有一种方法可以做到这一点,但是我将这作为练习留给读者,因为我不关心XPath。

要从第一列获取游戏编号,您可以执行此操作:

game_num_col = last_win_loss_row.at("td:first-child")
game_num = game_num_col.text.to_i
# => 82

要从第二栏获取日期:

date_col = last_win_loss_row.at("td:nth-child(2)") # XPath: td[2]
date = DateTime.parse(date_col.text)
# => 2015-04-15T00:00:00+00:00

如果您想要日期和时间,可以这样做:

time_col = last_win_loss_row.at("td:nth-child(3)")
date_time = DateTime.parse("#{date_col.text} #{time_col.text}")
# => 2015-04-15T08:00:00-03:00

答案 1 :(得分:1)

好吧,我这样做:

require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open("http://www.basketball-reference.com/teams/BRK/2015_games.html"))

latest_score_row = doc.search('//tr/td/a[contains(.,"Box Score")]/../..').last
latest_text = latest_score_row.search('td').map(&:text)
# => ["13",
#     "Sat, Nov 22, 2014",
#     "8:30p EST",
#     "",
#     "Box Score",
#     "@",
#     "San Antonio Spurs",
#     "L",
#     "",
#     "87",
#     "99",
#     "5",
#     "8",
#     "L 1",
#     ""]

但是YMMV。


它是如何工作的?简单。它在包含&#34; Box Score&#34;的页面中查找<a>个节点,然后,对于找到的每个节点,将两个级别备份到<tr>节点并将数组返回给Nokogiri / Ruby 。 last找到最后一个。

然后,只需查看<td>个节点的行并抓取其文本即可。

时间戳是从阵列中拉出日期和时间的问题,然后对&#34; am / pm&#34;进行一点点按摩。并让Ruby构建一个对象:

latest_time = Time.strptime(             
  [
    latest_text[1],                      # => "Sat, Nov 22, 2014"
    latest_text[2].sub(/([ap])/, '\1m')  # => "8:30pm EST"
  ].join(' '),                           # => "Sat, Nov 22, 2014 8:30pm EST"
  '%a, %b %d, %Y %H:%M%P %Z'             # => "%a, %b %d, %Y %H:%M%P %Z"
)                                        # => 2014-11-22 18:30:00 -0700