Question

我需要从每个论坛https://forums.vwvortex.com/forumdisplay.php?5001-VW-Model-Specific-Forums抓取数据。例如，我要从此https://forums.vwvortex.com/forumdisplay.php?1062-CC/page&pp=200抓取数据有没有一种方法可以从所有页面抓取数据而无需分配特定的页面编号以循环通过上述url？

我尝试了下面的代码，我觉得它太长了，我需要返回每个url论坛以查看分配最大页码并抓取数据所需要的页面数。

x  <- NULL
for (i in 1:86){
  k1<-"https://forums.vwvortex.com/forumdisplay.php?1062-CC/page"
  k2<-"&pp=200"
  url<-paste(k1,i,k2,sep="") 
  review <- read_html(url)
  Dates <- cbind(review %>% html_nodes(".author") %>%     html_text()     )
  threads<- cbind(review %>% html_nodes("h3.threadtitle") %>% html_nodes("a") %>%   html_attr("href")  )

  #threads <- cbind(review %>%   html_nodes(".content") %>% html_text() )
  A<- cbind( threads, Dates)
  #A <- as.data.frame(A)
  x<- rbind(x, A)
  x <- as.data.frame(x)
}

有没有一种方法可以自动抓取所有数据，而不管url中可用的页面数如何？

感谢任何帮助！预先感谢

有没有办法在不使用parLapply知道页面数的情况下对URL中的所有页面进行网络爬虫？

0 个答案: