在R

时间:2019-01-06 16:29:06

标签: r

我正在尝试从《体育参考》中提取大量数据。我的编码背景非常薄弱,因为我只在少数几个过程中可以自学。我已经弄清楚了如何使用htmltab()函数从SR中提取数据,并可以从网站上的每个页面创建一个表格。

我的问题是在最后合并表格。我知道下面的代码仅使用5页,并且使用rbind()可以很容易地合并,但这只是一个很小的测试示例。

我最终将有成千上万张表进行组合,因此最后手动重新整理它们是不切实际的。有没有一种方法可以在循环的每个步骤中将每个新表附加到某个组合表(或者在不输入数千个表的情况下轻松地在最后绑定它们)?

或者,如果我可以将所有数据合并到一个表中,而不必先创建一千个表,似乎效率更高,但我不知道该怎么做(显然)。

感谢您的帮助!

(对于不熟悉SR的用户,该站点将其表按100个元素进行分组,因此将i * 100分组并粘贴到URL的第一部分)

for (i in 1:5) {
     a <- i*100
     url <- paste("https://www.sports-reference.com/cfb/play-index/pgl_finder.cgi?request=1&match=game&year_min=&year_max=&conf_id=&school_id=&opp_id=&game_type=&game_num_min=&game_num_max=&game_location=&game_result=&class=&c1stat=rush_att&c1comp=gt&c1val=0&c2stat=rec&c2comp=gt&c2val=0&c3stat=punt_ret&c3comp=gt&c3val=0&c4stat=kick_ret&c4comp=gt&c4val=0&order_by=date_game&order_by_asc=&offset=",a,sep = "")
     nam <- paste("ploop",i,sep = "")
     assign(nam,htmltab(url))
     ??????
     }

2 个答案:

答案 0 :(得分:1)

在这种情况下,通常最好将结果存储在列表中,而不要混用assign。这里,我们将循环的每次迭代结果存储在一个列表中,然后将do.callrbind一起使用来创建单个数据帧:

rm(list = ls())
library(htmltab)

tables <- list()
for (i in 1:5) {
  a <- i*100
  url <- paste("https://www.sports-reference.com/cfb/play-index/pgl_finder.cgi?request=1&match=game&year_min=&year_max=&conf_id=&school_id=&opp_id=&game_type=&game_num_min=&game_num_max=&game_location=&game_result=&class=&c1stat=rush_att&c1comp=gt&c1val=0&c2stat=rec&c2comp=gt&c2val=0&c3stat=punt_ret&c3comp=gt&c3val=0&c4stat=kick_ret&c4comp=gt&c4val=0&order_by=date_game&order_by_asc=&offset=",a,sep = "")
  tables[[i]] <- htmltab(url)
}

table.final <- do.call(rbind, tables)

str(table.final)

'data.frame':   520 obs. of  20 variables:
 $ Rk              : chr  "101" "102" "103" "104" ...
 $ Player          : chr  "Myles Gaskin" "Willie Gay" "Jake Gervase" "Kyle Gibson" ...
 $ Date            : chr  "2019-01-01" "2019-01-01" "2019-01-01" "2019-01-01" ...
 $ G#              : chr  "14" "13" "13" "13" ...
 $ School          : chr  "Washington" "Mississippi State" "Iowa" "Central Florida" ...
 $ V2              : chr  "N" "N" "N" "N" ...
 $ Opponent        : chr  "Ohio State" "Iowa" "Mississippi State" "Louisiana State" ...
 $ V2.1            : chr  "L" "L" "W" "L" ...
 $ Rushing >> Att  : chr  "24" "0" "0" "0" ...
 $ Rushing >> Yds  : chr  "121" "0" "0" "0" ...
 $ Rushing >> TD   : chr  "2" "0" "0" "0" ...
 $ Receiving >> Rec: chr  "3" "0" "0" "0" ...
 $ Receiving >> Yds: chr  "-1" "0" "0" "0" ...
 $ Receiving >> TD : chr  "0" "0" "0" "0" ...
 $ Kick Ret >> Ret : chr  "0" "0" "0" "0" ...
 $ Kick Ret >> Yds : chr  "0" "0" "0" "0" ...
 $ Kick Ret >> TD  : chr  "0" "0" "0" "0" ...
 $ Punt Ret >> Ret : chr  "0" "0" "0" "0" ...
 $ Punt Ret >> Yds : chr  "0" "0" "0" "0" ...
 $ Punt Ret >> TD  : chr  "0" "0" "0" "0" ...

答案 1 :(得分:1)

您也可以尝试tidyverse方法:

url <- "https://www.sports-reference.com/cfb/play-index/pgl_finder.cgi?request=1&match=game&year_min=&year_max=&conf_id=&school_id=&opp_id=&game_type=&game_num_min=&game_num_max=&game_location=&game_result=&class=&c1stat=rush_att&c1comp=gt&c1val=0&c2stat=rec&c2comp=gt&c2val=0&c3stat=punt_ret&c3comp=gt&c3val=0&c4stat=kick_ret&c4comp=gt&c4val=0&order_by=date_game&order_by_asc=&offset="

df <- purrr::map_dfr(1:5,~htmltab::htmltab(paste0(url,.x*100)))