Question

我正在尝试从站点（thenumbers.com）上刮取数据，该站点上的数据分散在许多网页上。连续网页的格式如下（下面只是前三个）：

url0 <- "https://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time"
url1 <- "https://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/101"
url2 <- "https://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/201"

要将第一个顺序URL（url0）刮入df，此代码将返回正确的输出。

library(rvest)

webpage <- read_html("https://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time")

tbls <- html_nodes(webpage, "table")

head(tbls)

tbls_ls <- webpage %>%
  html_nodes("table") %>%
  .[1] %>%
  html_table(fill = TRUE)

df <- tbls_ls[[1]]

输出如下：

> head(df)
  Rank Released                                Movie DomesticBox Office
1    1     2015 Star Wars Ep. VII: The Force Awakens       $936,662,225
2    2     2009                               Avatar       $760,507,625
3    3     2018                        Black Panther       $700,059,566

如何自动抓取后续的url，直到到达数据的末尾，以便输出是一个rowbind()在一起的长df？

Answer 1

这个问题是在距 3 年前几个月前被问到的；但是这里有一个解决方案。

首先，确定是否允许抓取网站总是一个好主意。在 R 中，我们可以使用 robotstxt 包：

robotstxt::paths_allowed("https://www.the-numbers.com")
 www.the-numbers.com                      

[1] TRUE

好的，我们可以开始了。此外，我想重申@hrbrmstr 所指出的关于捐赠（即使是最小的金额）作为支持网站（或任何其他类似网站）背后的人正在做的事情的一种方式。

我在下面定义的抓取函数使用了 R 中的 repeat/if 结构（类似于其他编程语言中的 do-while 循环）。此外，由于要抓取的页面数量未知，因此该函数有一个 page_count 参数，默认情况下为 Inf。离开它会从网站上抓取所有页面。但是，如果您想抓取 10 页，那么他们可以设置 page_count = 10。下面是函数定义：

# Load packages ----

pacman::p_load(
  rvest,
  glue,
  stringr,
  dplyr,
  cli
)

# Custom function ----

scrape_data <- function(url, page_count = Inf){
  
  i <- 1
  data_list <- list()
  
  repeat {
    
    html <- read_html(url) 
    
    data_list[[i]] <- html %>%
      html_element(css = "table") %>%
      html_table()
    
    current_page <- html %>%
      html_element(css = "div.pagination > a.active") %>%
      html_text() %>%
      str_remove_all(pattern = "\\,")
    
    all_displayed_pages <- html %>%
      html_elements(css = "div.pagination > a") %>%
      html_text() %>%
      str_remove_all(pattern = "\\,") %>%
      str_extract(pattern = "\\d+\\-\\d+")
    
    all_pages_urls <- html %>%
      html_elements(css = "div.pagination > a") %>%
      html_attr(name = "href")
    
    url <- glue("https://www.the-numbers.com{all_pages_urls[which(current_page == all_displayed_pages)+1]}")
    cli_alert_success(glue("Scraped page: {i}"))
    
    i <- i + 1
    
    if(
      current_page == all_displayed_pages[length(all_displayed_pages)] |
      i - 1 == page_count
    ){
      break
    }
  }
  
  bind_rows(data_list)
  
}

现在让我们使用该函数来抓取表格的前 5 页：

scrape_data(
  url = "https://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time",
  page_count = 5
)

√ Scraped page: 1
√ Scraped page: 2
√ Scraped page: 3
√ Scraped page: 4
√ Scraped page: 5
# A tibble: 500 x 7
    Rank  Year Movie                                Distributor `DomesticBox Of~ `InternationalB~ `WorldwideBox O~
   <int> <int> <chr>                                <chr>       <chr>            <chr>            <chr>           
 1     1  2015 Star Wars Ep. VII: The Force Awakens Walt Disney $936,662,225     $1,127,953,592   $2,064,615,817  
 2     2  2019 Avengers: Endgame                    Walt Disney $858,373,000     $1,939,427,564   $2,797,800,564  
 3     3  2009 Avatar                               20th Cent…  $760,507,625     $2,085,391,916   $2,845,899,541  
 4     4  2018 Black Panther                        Walt Disney $700,059,566     $636,434,755     $1,336,494,321  
 5     5  2018 Avengers: Infinity War               Walt Disney $678,815,482     $1,365,725,041   $2,044,540,523  
 6     6  1997 Titanic                              Paramount…  $659,363,944     $1,548,622,601   $2,207,986,545  
 7     7  2015 Jurassic World                       Universal   $652,306,625     $1,017,673,342   $1,669,979,967  
 8     8  2012 The Avengers                         Walt Disney $623,357,910     $891,742,301     $1,515,100,211  
 9     9  2017 Star Wars Ep. VIII: The Last Jedi    Walt Disney $620,181,382     $711,453,759     $1,331,635,141  
10    10  2018 Incredibles 2                        Walt Disney $608,581,744     $634,223,615     $1,242,805,359  
# ... with 490 more rows

该功能的一个可能改进是使用 Sys.sleep(3) 增加一些不活动时间（持续 3 秒），以防服务器因您尝试过快地多次访问网站而将您踢出网站。< /p>

使用R和Rvest抓取没有明显分页类的html表

1 个答案: