rvest:使用相同的标签跟随不同的链接

时间:2015-09-05 13:40:35

标签: r hyperlink rvest

我在R做了一个涉及从网站上抓取一些足球数据的小项目。以下是其中一年数据的链接:

http://www.sports-reference.com/cfb/years/2007-schedule.html

正如您所看到的,有一个"日期"日期超链接的列,此超链接将您带到该特定游戏的统计数据,这是我想要抓取的数据。不幸的是,很多游戏都发生在同一个日期,这意味着它们的超链接是相同的。因此,如果我从表中删除超链接(我已经完成),然后执行以下操作:

url = 'http://www.sports-reference.com/cfb/years/2007-schedule.html'
links = character vector with scraped date links
for (i in 1:length(links)) {
  stats = html_session(url) %>%
    follow_link(link[i]) %>%
    html_nodes('whateverthisnodeis') %>%
    html_table()
}

它将从每个日期对应的第一个链接中删除。例如,2007年8月30日发生了11场比赛,但如果我把它放在follow_link函数中,它每次都会从第一场比赛(Boise St. Weber St.)获取数据。有什么方法可以指明我希望它能在桌子上移动吗?

我已经找到了一个解决方法,找出了日期超链接带给你的网址的公式,但这是一个非常复杂的过程,所以我想我是否知道有没有人知道怎么做就这样。

2 个答案:

答案 0 :(得分:1)

这是一个完整的例子:

library(rvest)
library(dplyr)
library(pbapply)

# Get the main page

URL <- 'http://www.sports-reference.com/cfb/years/2007-schedule.html'
pg <- html(URL)

# Get the dates links
links <- html_attr(html_nodes(pg, xpath="//table/tbody/tr/td[3]/a"), "href")

# I'm only limiting to 10 since I rly don't care about football 
# enough to waste the bandwidth.
#
# You can just remove the [1:10] for your needs
# pblapply gives you a much-needed progress bar for free

scoring_games <- pblapply(links[1:10], function(x) {

  game_pg <- html(sprintf("http://www.sports-reference.com%s", x))
  scoring <- html_table(html_nodes(game_pg, xpath="//table[@id='passing']"), header=TRUE)[[1]]
  colnames(scoring) <- scoring[1,]
  filter(scoring[-1,], !Player %in% c("", "Player"))

})

# you can bind_rows them all together but you should 
# probably add a column for the game then

bind_rows(scoring_games)

## Source: local data frame [27 x 11]
## 
##             Player            School   Cmp   Att   Pct   Yds   Y/A  AY/A    TD   Int  Rate
##              (chr)             (chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr)
## 1     Taylor Tharp       Boise State    14    19  73.7   184   9.7  10.7     1     0 172.4
## 2       Nick Lomax       Boise State     1     5  20.0     5   1.0   1.0     0     0  28.4
## 3    Ricky Cookman       Boise State     1     2  50.0     9   4.5 -18.0     0     1 -12.2
## 4         Ben Mauk        Cincinnati    18    27  66.7   244   9.0   8.9     2     1 159.6
## 5        Tony Pike        Cincinnati     6     9  66.7    57   6.3   8.6     1     0 156.5
## 6   Julian Edelman        Kent State    17    26  65.4   161   6.2   3.5     1     2 114.7
## 7       Bret Meyer        Iowa State    14    23  60.9   148   6.4   3.4     1     2 111.9
## 8       Matt Flynn   Louisiana State    12    19  63.2   128   6.7   8.8     2     0 154.5
## 9  Ryan Perrilloux   Louisiana State     2     3  66.7    21   7.0  13.7     1     0 235.5
## 10   Michael Henig Mississippi State    11    28  39.3   120   4.3  -5.4     0     6  32.4
## ..             ...               ...   ...   ...   ...   ...   ...   ...   ...   ...   ...

答案 1 :(得分:0)

你要经历一个循环,但是有时候设置为同一个变量,试试这个:

url = 'http://www.sports-reference.com/cfb/years/2007-schedule.html'
links = character vector with scraped date links
for (i in 1:length(links)) {
    stats[i] = html_session(url) %>%
    follow_link(link[i]) %>%
    html_nodes('whateverthisnodeis') %>%
    html_table()

}