从多个搜索结果中抓取信息

时间:2021-03-12 02:00:52

标签: r web-scraping rvest

TL;DR 我可以获得每个配置文件的 URL,但我不知道如何从每个配置文件中抓取并将信息放入表格中

我是网络抓取的新手,我正在尝试从这个 website 的个人资料中抓取信息

这并不违反他们的使用条款,但该网站也没有 API。我能够从搜索结果的所有页面中提取每个配置文件的 URL,然后将它们粘贴到域名中。但是,我只能对一页结果执行此操作,而且我无法通过这些 URL 从实际配置文件中抓取信息。

我的代码如下:

#Scrape the profile URLs
profile_url_lst <- list()
for(page_num in 1:73){
  main_url <- paste0("https://www.theeroticreview.com/reviews/newreviewsList.asp?searchreview=1&gCity=region1%2Dus%2Drhode%2Disland&gCityName=Rhode+Island+%28State%29&SortBy=3&gDistance=0&page=", page_num)
  html_content <- read_html(main_url)
  profile_urls <- html_content %>% html_nodes("body")%>% html_children() %>% html_children() %>% .[2] %>% html_children() %>% 
    html_children() %>% .[3] %>% html_children() %>% .[4] %>% html_children() %>% html_children() %>% html_children() %>% 
    html_attr("href")
  
  profile_url_lst[[page_num]] <- profile_urls
Sys.sleep(2)
}
#Bind into list and combine with domain name 
profiles <- cbind(profile_urls)
complete_urls <- paste0('https://www.theeroticreview.com', profile_urls)
complete <- cbind(complete_urls)
complete

#Scrape information from each profile
TED_lst <- list()
base_url <- "https://www.theeroticreview.com"
completed <- c(profile_urls)
for(i in completed) {
  urls <- paste(base_url, i, sep = "")
  pages <- read_html(urls)
  
  TED <- pages %>% html_nodes(".hidden-sm , .td-date .td-link , #collapse3 .col-sm-6+ .col-sm-6 .row:nth-child(5) .col-xs-6 , #collapse3 .row:nth-child(3) .col-xs-6 , .col-sm-6:nth-child(1) .row:nth-child(1) .col-xs-6 , .col-sm-6+ .col-sm-6 .row:nth-child(3) .col-xs-6 , .col-sm-6:nth-child(1) .row:nth-child(2) .col-xs-6 , #collapse1 .row:nth-child(8) .col-xs-6 , #collapse1 .col-sm-6:nth-child(2) .row:nth-child(5) .col-xs-6 , #collapse1 .col-sm-6+ .col-sm-6 .row:nth-child(1) .col-xs-6 , #collapse1 .col-sm-6:nth-child(2) .row:nth-child(2) .col-xs-6 , .float-heading-left p , h1") %>% html_text()
  TED_lst <- TEDs
}

当我运行此代码时,我只能为单个页面生成完整的 URL,虽然从配置文件中抓取的代码使用这些 URL 之一并删除了循环函数,但尝试运行上述函数返回 NULL 和单个配置文件的 URL。我如何让它从多个配置文件中抓取所有信息,然后将其绑定到一个表中以用于回归分析?

1 个答案:

答案 0 :(得分:0)

如果我理解正确,您想抓取 complete_urls 中列出的每个 URL,对吗?

在这种情况下,您的最终 for 循环应为:

for(i in completed) {
  pages <- read_html(complete_urls[[i]]) # I changed this one
  
  TED <- pages %>% html_nodes(".hidden-sm , .td-date .td-link , #collapse3 .col-sm-6+ .col-sm-6 .row:nth-child(5) .col-xs-6 , #collapse3 .row:nth-child(3) .col-xs-6 , .col-sm-6:nth-child(1) .row:nth-child(1) .col-xs-6 , .col-sm-6+ .col-sm-6 .row:nth-child(3) .col-xs-6 , .col-sm-6:nth-child(1) .row:nth-child(2) .col-xs-6 , #collapse1 .row:nth-child(8) .col-xs-6 , #collapse1 .col-sm-6:nth-child(2) .row:nth-child(5) .col-xs-6 , #collapse1 .col-sm-6+ .col-sm-6 .row:nth-child(1) .col-xs-6 , #collapse1 .col-sm-6:nth-child(2) .row:nth-child(2) .col-xs-6 , .float-heading-left p , h1") %>% html_text()
  TED_lst <- TEDs
}