Question

所以我想抓一些NBA数据。以下是我到目前为止所做的，它功能完善：

install.packages('rvest')
library(rvest)

url = "https://www.basketball-reference.com/boxscores/201710180BOS.html"
webpage = read_html(url)
table = html_nodes(webpage, 'table')
data = html_table(table)

away = data[[1]]
home = data[[3]]

colnames(away) = away[1,] #set appropriate column names
colnames(home) = home[1,]

away = away[away$MP != "MP",] #remove rows that are just column names
home = home[home$MP != "MP",]

问题是这些表格不包括团队名称，这很重要。为了获得这些信息，我想我会在网页上搜索四个因子表，但是，rvest似乎并没有认识到这是一个表格。包含四个因子表的div是：

<div class="overthrow table_container" id="div_four_factors">

表格是：

<table class="suppress_all sortable stats_table now_sortable" id="four_factors" data-cols-to-freeze="1"><thead><tr class="over_header thead">

这让我觉得我可以通过

的内容访问该表

table = html_nodes(webpage,'#div_four_factors')

但这似乎不起作用，因为我只是一个空列表。如何访问四个因子表？

Answer 1

我绝不是HTML专家，但似乎您感兴趣的表在源代码中被注释掉，然后在呈现之前在某个时候覆盖注释。

如果我们假设主队总是排在第二位，我们可以使用位置参数并在页面上刮另一个表：

table = html_nodes(webpage,'#bottom_nav_container')
teams <- html_text(table[1]) %>%
  stringr::str_split("Schedule\n")

away$team <- trimws(teams[[1]][1])
home$team <- trimws(teams[[1]][2])

显然不是最干净的解决方案，但这就是网络抓取世界中的生活

使用rvest访问html表

1 个答案: