使用rvest软件包在多个页面的多个表中抓取内容

时间:2020-08-31 03:47:23

标签: r web-scraping rvest

我对R和rvest软件包非常陌生,我正尝试从多个页面的多个表中提取数据。

一个例子是这里每个游戏的盒子得分:

https://www.pro-football-reference.com/boxscores/201309050den.htm

我尝试了以下操作来从一个表中获取数据:

library(rvest)

webpage <- read_html("https://www.pro-football-reference.com/boxscores/201309050den.htm")

tbls <- html_nodes(webpage, "table")

head(tbls)


tbls_ls <- webpage %>%
  html_nodes("table") %>%
  .[3:3] %>%
  html_table(fill = TRUE)

str(tbls_ls)

这将返回:

List of 1
    $ :'data.frame':      22 obs. of  22 variables:
      ..$          : chr [1:22] "Player" "Joe Flacco" "Ray Rice" "Bernard Pierce" ...
      ..$          : chr [1:22] "Tm" "BAL" "BAL" "BAL" ...
      ..$ Passing  : chr [1:22] "Cmp" "34" "0" "0" ...
      ..$ Passing  : chr [1:22] "Att" "62" "0" "0" ...
      ..$ Passing  : chr [1:22] "Yds" "362" "0" "0" ...
      ..$ Passing  : chr [1:22] "TD" "2" "0" "0" ...
      ..$ Passing  : chr [1:22] "Int" "2" "0" "0" ...
      ..$ Passing  : chr [1:22] "Sk" "4" "0" "0" ...
      ..$ Passing  : chr [1:22] "Yds" "27" "0" "0" ...
      ..$ Passing  : chr [1:22] "Lng" "34" "0" "0" ...
      ..$ Passing  : chr [1:22] "Rate" "69.4" "" "" ...
      ..$ Rushing  : chr [1:22] "Att" "0" "12" "9" ...
      ..$ Rushing  : chr [1:22] "Yds" "0" "36" "22" ...
      ..$ Rushing  : chr [1:22] "TD" "0" "1" "0" ...
      ..$ Rushing  : chr [1:22] "Lng" "0" "12" "14" ...
      ..$ Receiving: chr [1:22] "Tgt" "0" "11" "1" ...
      ..$ Receiving: chr [1:22] "Rec" "0" "8" "0" ...
      ..$ Receiving: chr [1:22] "Yds" "0" "35" "0" ...
      ..$ Receiving: chr [1:22] "TD" "0" "0" "0" ...
      ..$ Receiving: chr [1:22] "Lng" "0" "10" "0" ...
      ..$ Fumbles  : chr [1:22] "Fmb" "1" "0" "0" ...
      ..$ Fumbles  : chr [1:22] "FL" "0" "0" "0" ...

但这只是一场比赛的一张桌子。

我试图在每年的每周中浏览每个Boxscore的所有页面。

所有页面均以URL的这一部分开头:

https://www.pro-football-reference.com/boxscores/

但是然后我需要遍历一年中的所有日期,例如:

201309050
201309080

和团队:

den
buf

(这将是NFL中的所有32支球队)

上面的两个示例将转到以下两个URL:

https://www.pro-football-reference.com/boxscores/201309050den.htm
https://www.pro-football-reference.com/boxscores/201309080buf.htm

如果我有一个日期向量和一个团队向量,是否有办法遍历每个日期来检查每个组合并从每页的表中返回信息?

或者我可以使用开始日期和结束日期,然后以某种方式使用每个团队名称浏览范围内的每个日期?

开始日期为

20130901

结束日期为

20140301

(针对2013赛季)。最好还有2010年至2019年的整个季节。

理想情况下,我想遍历一年中的每个日期以及每个团队,如果返回记录,我想将它们全部添加到一个表中,如下所示:

Year   Week   Player  Team    Cmp   Att   Yds   TD   Int   Sk   Yds   Lng  Rate   Att   Yds   TD   Lng   Tht   Rec   Yds   TD   Lng   Fmb   FL

最好只返回每个四分卫的记录,尽管我不确定如何实现。

0 个答案:

没有答案