因此,我在这里尝试从IMDB链接中抓取数据:https://www.imdb.com/search/title?release_date=2010-01-01,2017-12-31&count=100&start=101&ref_=adv_prv
我想用下面的代码抓取运行时和标题数据。但是,我想知道如何对其他多个页面执行相同的操作?我尝试做一个for循环,但是我不知道如何将其合并到我的代码中。模式如下:
https://www.imdb.com/search/title?release_date=2010-01-01,2017-12-31&count=100&start=101&ref_=adv_prv
https://www.imdb.com/search/title?release_date=2010-01-01,2017-12-31&count=100&start=201&ref_=adv_nxt
https://www.imdb.com/search/title?release_date=2010-01-01,2017-12-31&count=100&start=301&ref_=adv_nxt
我的代码:
url <- 'https://www.imdb.com/search/title?release_date=2010-01-01,2017-12-31&count=100&start=101&ref_=adv_prv'
webpage <- read_html(url)
titlehtml <- html_nodes(webpage,'.lister-item-header a')
title <- html_text(titlehtml)
runtimehtml <- html_nodes(webpage,'.text-muted .runtime')
runtime <- html_text(runtimehtml)
runtime<-gsub(" min","",runtime)# removing mins and converting it to numerical
runtime<-as.numeric(runtime)
答案 0 :(得分:0)
尝试一下:
urls <- c("https://www.imdb.com/search/title?release_date=2010-01-01,2017-12-31&count=100&start=101&ref_=adv_prv",
"https://www.imdb.com/search/title?release_date=2010-01-01,2017-12-31&count=100&start=201&ref_=adv_nxt",
"https://www.imdb.com/search/title?release_date=2010-01-01,2017-12-31&count=100&start=301&ref_=adv_nxt")`
results_list <- list()
for(.page in seq_along(urls)){
webpage <- read_html(urls[[.page]])
titlehtml <- html_nodes(webpage,'.lister-item-header a')
title <- html_text(titlehtml)
runtimehtml <- html_nodes(webpage,'.text-muted .runtime')
runtime <- html_text(runtimehtml)
runtime <- gsub(" min","",runtime)
results_list[[.page]] <- data.frame(title = title,
runtime = as.numeric(runtime)
)
}
final_results <- plyr::ldply(results_list)