我在R做了一个涉及从网站上抓取一些足球数据的小项目。以下是其中一年数据的链接:
http://www.sports-reference.com/cfb/years/2007-schedule.html。
正如您所看到的,有一个"日期"日期超链接的列,此超链接将您带到该特定游戏的统计数据,这是我想要抓取的数据。不幸的是,很多游戏都发生在同一个日期,这意味着它们的超链接是相同的。因此,如果我从表中删除超链接(我已经完成),然后执行以下操作:
url = 'http://www.sports-reference.com/cfb/years/2007-schedule.html'
links = character vector with scraped date links
for (i in 1:length(links)) {
stats = html_session(url) %>%
follow_link(link[i]) %>%
html_nodes('whateverthisnodeis') %>%
html_table()
}
它将从每个日期对应的第一个链接中删除。例如,2007年8月30日发生了11场比赛,但如果我把它放在follow_link函数中,它每次都会从第一场比赛(Boise St. Weber St.)获取数据。有什么方法可以指明我希望它能在桌子上移动吗?
我已经找到了一个解决方法,找出了日期超链接带给你的网址的公式,但这是一个非常复杂的过程,所以我想我是否知道有没有人知道怎么做就这样。
答案 0 :(得分:1)
这是一个完整的例子:
library(rvest)
library(dplyr)
library(pbapply)
# Get the main page
URL <- 'http://www.sports-reference.com/cfb/years/2007-schedule.html'
pg <- html(URL)
# Get the dates links
links <- html_attr(html_nodes(pg, xpath="//table/tbody/tr/td[3]/a"), "href")
# I'm only limiting to 10 since I rly don't care about football
# enough to waste the bandwidth.
#
# You can just remove the [1:10] for your needs
# pblapply gives you a much-needed progress bar for free
scoring_games <- pblapply(links[1:10], function(x) {
game_pg <- html(sprintf("http://www.sports-reference.com%s", x))
scoring <- html_table(html_nodes(game_pg, xpath="//table[@id='passing']"), header=TRUE)[[1]]
colnames(scoring) <- scoring[1,]
filter(scoring[-1,], !Player %in% c("", "Player"))
})
# you can bind_rows them all together but you should
# probably add a column for the game then
bind_rows(scoring_games)
## Source: local data frame [27 x 11]
##
## Player School Cmp Att Pct Yds Y/A AY/A TD Int Rate
## (chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr) (chr)
## 1 Taylor Tharp Boise State 14 19 73.7 184 9.7 10.7 1 0 172.4
## 2 Nick Lomax Boise State 1 5 20.0 5 1.0 1.0 0 0 28.4
## 3 Ricky Cookman Boise State 1 2 50.0 9 4.5 -18.0 0 1 -12.2
## 4 Ben Mauk Cincinnati 18 27 66.7 244 9.0 8.9 2 1 159.6
## 5 Tony Pike Cincinnati 6 9 66.7 57 6.3 8.6 1 0 156.5
## 6 Julian Edelman Kent State 17 26 65.4 161 6.2 3.5 1 2 114.7
## 7 Bret Meyer Iowa State 14 23 60.9 148 6.4 3.4 1 2 111.9
## 8 Matt Flynn Louisiana State 12 19 63.2 128 6.7 8.8 2 0 154.5
## 9 Ryan Perrilloux Louisiana State 2 3 66.7 21 7.0 13.7 1 0 235.5
## 10 Michael Henig Mississippi State 11 28 39.3 120 4.3 -5.4 0 6 32.4
## .. ... ... ... ... ... ... ... ... ... ... ...
答案 1 :(得分:0)
你要经历一个循环,但是有时候设置为同一个变量,试试这个:
url = 'http://www.sports-reference.com/cfb/years/2007-schedule.html'
links = character vector with scraped date links
for (i in 1:length(links)) {
stats[i] = html_session(url) %>%
follow_link(link[i]) %>%
html_nodes('whateverthisnodeis') %>%
html_table()
}