使用rvest通过网页抓取多个页面

时间:2016-11-14 20:47:48

标签: r url dataframe web-scraping rvest

我有一个网站,我想从中抓取数据,但有8页的数据。我使用以下内容获取数据的第一页

library(rvest)
library(stringr)
library(tidyr)

site <- 'http://www.basketball-reference.com/play-index/draft_finder.cgi?request=1&year_min=2001&year_max=2014&college_id=0&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&order_by=year_id'
webpage <- read_html(site)

draft_table <- html_nodes(webpage, 'table')
draft <- html_table(draft_table)[[1]]
head(draft)
draft <- draft[-1,]
names(draft) <- c("rank", "year", "league", "round", "pick", "team",     "player", "age", "position", "birth", "college",
                  "yearin", "lastyear", "car.gp", "car.mp", "car.ppg", "car.rebpg", "car.apg", "car.stlpg", "car.blkpg",
              "car.fgp", "car.2pfgp", "car.3pfgp", "car.ftp", "car.ws", "car.ws48")

draft <- draft[draft$player != "" & draft$player != "Player", ]

似乎网址按顺序移动,第一页的偏移量为0,第二页偏移= 100,第三页偏移= 200,依此类推。

我的问题是,我无法找到一种简单的方法来同时刮掉所有8个页面,而无需手动将网址粘贴到网站中#34;上面的矢量。我希望能够做到这一点,所以如果有一个普遍的rvest解决方案将是伟大的。否则,任何帮助或建议将不胜感激。非常感谢。

1 个答案:

答案 0 :(得分:3)

函数follow_link正是您所寻找的。

library(rvest)
library(stringr)
library(tidyr)

site <- 'http://www.basketball-reference.com/play-index/draft_finder.cgi?request=1&year_min=2001&year_max=2014&college_id=0&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&order_by=year_id'

s <- html_session(site)
s <- s %>% follow_link(css = '#pi p a')
url2 <- s$handle$url

s <- s %>% follow_link(css = '#pi a+ a')
url3 <- s$handle$url

链接模式似乎在第二页之后循环,因此后续页面可以导航到follow_link(css = '#pi a+ a')