使用IMDB中的RVest抓取多个页面

时间:2018-12-11 18:13:12

标签: r

因此,我在这里尝试从IMDB链接中抓取数据:https://www.imdb.com/search/title?release_date=2010-01-01,2017-12-31&count=100&start=101&ref_=adv_prv

我想用下面的代码抓取运行时和标题数据。但是,我想知道如何对其他多个页面执行相同的操作?我尝试做一个for循环,但是我不知道如何将其合并到我的代码中。模式如下:

https://www.imdb.com/search/title?release_date=2010-01-01,2017-12-31&count=100&start=101&ref_=adv_prv
https://www.imdb.com/search/title?release_date=2010-01-01,2017-12-31&count=100&start=201&ref_=adv_nxt
https://www.imdb.com/search/title?release_date=2010-01-01,2017-12-31&count=100&start=301&ref_=adv_nxt

我的代码:

url <- 'https://www.imdb.com/search/title?release_date=2010-01-01,2017-12-31&count=100&start=101&ref_=adv_prv'
    webpage <- read_html(url)

titlehtml <- html_nodes(webpage,'.lister-item-header a')
title <- html_text(titlehtml)


runtimehtml <- html_nodes(webpage,'.text-muted .runtime')
runtime <- html_text(runtimehtml)
runtime<-gsub(" min","",runtime)# removing mins and converting it to numerical
runtime<-as.numeric(runtime)

1 个答案:

答案 0 :(得分:0)

尝试一下:

urls <- c("https://www.imdb.com/search/title?release_date=2010-01-01,2017-12-31&count=100&start=101&ref_=adv_prv",
      "https://www.imdb.com/search/title?release_date=2010-01-01,2017-12-31&count=100&start=201&ref_=adv_nxt",
      "https://www.imdb.com/search/title?release_date=2010-01-01,2017-12-31&count=100&start=301&ref_=adv_nxt")`

results_list <- list()

for(.page in seq_along(urls)){
  webpage <- read_html(urls[[.page]])
  titlehtml <- html_nodes(webpage,'.lister-item-header a')
  title <- html_text(titlehtml)
  runtimehtml <- html_nodes(webpage,'.text-muted .runtime')
  runtime <- html_text(runtimehtml)
  runtime <- gsub(" min","",runtime)
  results_list[[.page]] <- data.frame(title = title,
                                  runtime = as.numeric(runtime)
                                  )
}

final_results <- plyr::ldply(results_list)