我正在尝试从站点(thenumbers.com)上刮取数据,该站点上的数据分散在许多网页上。连续网页的格式如下(下面只是前三个):
url0 <- "https://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time"
url1 <- "https://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/101"
url2 <- "https://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time/201"
要将第一个顺序URL(url0)刮入df,此代码将返回正确的输出。
library(rvest)
webpage <- read_html("https://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time")
tbls <- html_nodes(webpage, "table")
head(tbls)
tbls_ls <- webpage %>%
html_nodes("table") %>%
.[1] %>%
html_table(fill = TRUE)
df <- tbls_ls[[1]]
输出如下:
> head(df)
Rank Released Movie DomesticBox Office
1 1 2015 Star Wars Ep. VII: The Force Awakens $936,662,225
2 2 2009 Avatar $760,507,625
3 3 2018 Black Panther $700,059,566
如何自动抓取后续的url,直到到达数据的末尾,以便输出是一个rowbind()
在一起的长df?
答案 0 :(得分:1)
这个问题是在距 3 年前几个月前被问到的;但是这里有一个解决方案。
首先,确定是否允许抓取网站总是一个好主意。在 R 中,我们可以使用 robotstxt
包:
robotstxt::paths_allowed("https://www.the-numbers.com")
www.the-numbers.com
[1] TRUE
好的,我们可以开始了。此外,我想重申@hrbrmstr 所指出的关于捐赠(即使是最小的金额)作为支持网站(或任何其他类似网站)背后的人正在做的事情的一种方式。
我在下面定义的抓取函数使用了 R 中的 repeat/if 结构(类似于其他编程语言中的 do-while 循环)。此外,由于要抓取的页面数量未知,因此该函数有一个 page_count
参数,默认情况下为 Inf
。离开它会从网站上抓取所有页面。但是,如果您想抓取 10 页,那么他们可以设置 page_count = 10
。下面是函数定义:
# Load packages ----
pacman::p_load(
rvest,
glue,
stringr,
dplyr,
cli
)
# Custom function ----
scrape_data <- function(url, page_count = Inf){
i <- 1
data_list <- list()
repeat {
html <- read_html(url)
data_list[[i]] <- html %>%
html_element(css = "table") %>%
html_table()
current_page <- html %>%
html_element(css = "div.pagination > a.active") %>%
html_text() %>%
str_remove_all(pattern = "\\,")
all_displayed_pages <- html %>%
html_elements(css = "div.pagination > a") %>%
html_text() %>%
str_remove_all(pattern = "\\,") %>%
str_extract(pattern = "\\d+\\-\\d+")
all_pages_urls <- html %>%
html_elements(css = "div.pagination > a") %>%
html_attr(name = "href")
url <- glue("https://www.the-numbers.com{all_pages_urls[which(current_page == all_displayed_pages)+1]}")
cli_alert_success(glue("Scraped page: {i}"))
i <- i + 1
if(
current_page == all_displayed_pages[length(all_displayed_pages)] |
i - 1 == page_count
){
break
}
}
bind_rows(data_list)
}
现在让我们使用该函数来抓取表格的前 5 页:
scrape_data(
url = "https://www.the-numbers.com/box-office-records/domestic/all-movies/cumulative/all-time",
page_count = 5
)
√ Scraped page: 1
√ Scraped page: 2
√ Scraped page: 3
√ Scraped page: 4
√ Scraped page: 5
# A tibble: 500 x 7
Rank Year Movie Distributor `DomesticBox Of~ `InternationalB~ `WorldwideBox O~
<int> <int> <chr> <chr> <chr> <chr> <chr>
1 1 2015 Star Wars Ep. VII: The Force Awakens Walt Disney $936,662,225 $1,127,953,592 $2,064,615,817
2 2 2019 Avengers: Endgame Walt Disney $858,373,000 $1,939,427,564 $2,797,800,564
3 3 2009 Avatar 20th Cent… $760,507,625 $2,085,391,916 $2,845,899,541
4 4 2018 Black Panther Walt Disney $700,059,566 $636,434,755 $1,336,494,321
5 5 2018 Avengers: Infinity War Walt Disney $678,815,482 $1,365,725,041 $2,044,540,523
6 6 1997 Titanic Paramount… $659,363,944 $1,548,622,601 $2,207,986,545
7 7 2015 Jurassic World Universal $652,306,625 $1,017,673,342 $1,669,979,967
8 8 2012 The Avengers Walt Disney $623,357,910 $891,742,301 $1,515,100,211
9 9 2017 Star Wars Ep. VIII: The Last Jedi Walt Disney $620,181,382 $711,453,759 $1,331,635,141
10 10 2018 Incredibles 2 Walt Disney $608,581,744 $634,223,615 $1,242,805,359
# ... with 490 more rows
该功能的一个可能改进是使用 Sys.sleep(3)
增加一些不活动时间(持续 3 秒),以防服务器因您尝试过快地多次访问网站而将您踢出网站。< /p>