使用 Rvest 在多个页面上抓取表格

时间:2020-12-28 18:12:42

标签: r rvest

我正在尝试从网站上抓取一张表格。我设法编写了最少的代码来从表中获取数据。见下面的代码:

 start_date <- "1947-01-01"
    end_date <- "2020-12-28"
    css_selector <- ".datatable"
    
    url <- paste0("https://www.prosportstransactions.com/basketball/Search/SearchResults.php?Player=&Team=&BeginDate=", start_date,"&EndDate=", end_date, "&ILChkBx=yes&InjuriesChkBx=yes&PersonalChkBx=yes&Submit=Search&start=0")
    webpage <- xml2::read_html(url)


    data <- webpage %>%
      rvest::html_node(css = css_selector) %>%
      rvest::html_table() %>% 
      as_tibble()
    
    colnames(data) = data[1,]

    data <- data[-1, ]


然而,该表格被分成多页,每页仅显示 25 行。

我检查了 this solution before 但不同的是,对于我正在使用的表格,链接是用起始行号(而不是页码)修改的。

任何有关如何解决此问题的想法将不胜感激。

1 个答案:

答案 0 :(得分:1)

可以使用 URL 中的最后一个参数 &start= 逐页遍历结果。搜索结果页面每页呈现 25 个项目,因此页面顺序为 25、50、75、100...

我们将获取结果的前 5 页,共 125 笔交易。由于第一页以 &start=0 开头,因此我们分配一个向量 startRows 来表示每页的起始行。

然后我们使用向量来驱动 lapply() 和匿名函数,该函数读取数据并对其进行操作以从读取的每一页数据中删除标题行。

library(rvest)
library(dplyr)
start_date <- "1947-01-01"
end_date <- "2020-12-28"
css_selector <- ".datatable"
startRows <- c(0,25,50,75,100)
pages <- lapply(startRows,function(x){
     url <- paste0("https://www.prosportstransactions.com/basketball/Search/SearchResults.php?Player=&Team=&BeginDate=", start_date,"&EndDate=", end_date, 
                   "&ILChkBx=yes&InjuriesChkBx=yes&PersonalChkBx=yes&Submit=Search&start=",x)
     webpage <- xml2::read_html(url)
     data <- webpage %>%
          rvest::html_node(css = css_selector) %>%
          rvest::html_table() %>% 
          as_tibble()
     colnames(data) = data[1,]
     data[-1, ]
})
data <- do.call(rbind,pages)
head(data,n=10)

...和输出:

> head(data,n=10)
# A tibble: 10 x 5
   Date     Team      Acquired            Relinquished          Notes                          
   <chr>    <chr>     <chr>               <chr>                 <chr>                          
 1 1947-08… Bombers … ""                  "• Jack Underman"     fractured legs (in auto accide…
 2 1948-02… Bullets … "• Harry Jeannette… ""                    broken rib (DTD) (date approxi…
 3 1949-03… Capitols  ""                  "• Horace McKinney /… personal reasons (DTD)         
 4 1949-11… Capitols  ""                  "• Fred Scolari"      fractured right cheekbone (out…
 5 1949-12… Knicks    ""                  "• Vince Boryla"      mumps (out ~2 weeks)           
 6 1950-01… Knicks    "• Vince Boryla"    ""                    returned to lineup (date appro…
 7 1950-10… Knicks    ""                  "• Goebel Ritter / T… bruised ligaments in left ankl…
 8 1950-11… Warriors  ""                  "• Andy Phillip"      lacerated foot (DTD)           
 9 1950-12… Celtics   ""                  "• Andy Duncan (a)"   fractured kneecap (out indefin…
10 1951-12… Bullets   ""                  "• Don Barksdale"     placed on IL                   
> 

验证结果

我们可以通过打印每页的第一行和最后一行来验证结果,从第 1 页的最后一次观察开始。

data[c(25,26,50,51,75,76,100,101,125),]

...以及与在网站上手动导航时在搜索结果的第 1 - 5 页上呈现的内容相匹配的输出。

> data[c(25,26,50,51,75,76,100,101,125),]
# A tibble: 9 x 5
  Date      Team        Acquired      Relinquished    Notes                                    
  <chr>     <chr>       <chr>         <chr>           <chr>                                    
1 1960-01-… Celtics     ""            "• Bill Sharma… sprained Achilles tendon (date approxima…
2 1960-01-… Celtics     ""            "• Jim Loscuto… sore back and legs (out indefinitely) (d…
3 1964-10-… Knicks      "• Art Heyma… ""              returned to lineup                       
4 1964-12-… Hawks       "• Bob Petti… ""              returned to lineup (date approximate)    
5 1968-11-… Nets (ABA)  ""            "• Levern Tart" fractured right cheekbone (out indefinit…
6 1968-12-… Pipers (AB… ""            "• Jim Harding" took leave of absence as head coach for …
7 1970-08-… Lakers      ""            "• Earnie Kill… dislocated left foot (out indefinitely)  
8 1970-10-… Lakers      ""            "• Elgin Baylo… torn Achilles tendon (out for season) (d…
9 1972-01-… Cavaliers   "• Austin Ca… ""              returned to lineup                       

如果我们查看表格中的最后一页,我们会发现页面系列的最大值是 &start=61475。生成整个页面序列(2460,与网站搜索结果中列出的页面数相匹配)的 R 代码是:

# generate entire sequence of pages
pages <- c(0,seq(from=25,to=61475,by=25))

...和输出:

> head(pages)
[1]   0  25  50  75 100 125
> tail(pages)
[1] 61350 61375 61400 61425 61450 61475