TL;DR 我可以获得每个配置文件的 URL,但我不知道如何从每个配置文件中抓取并将信息放入表格中
我是网络抓取的新手,我正在尝试从这个 website 的个人资料中抓取信息
这并不违反他们的使用条款,但该网站也没有 API。我能够从搜索结果的所有页面中提取每个配置文件的 URL,然后将它们粘贴到域名中。但是,我只能对一页结果执行此操作,而且我无法通过这些 URL 从实际配置文件中抓取信息。
我的代码如下:
#Scrape the profile URLs
profile_url_lst <- list()
for(page_num in 1:73){
main_url <- paste0("https://www.theeroticreview.com/reviews/newreviewsList.asp?searchreview=1&gCity=region1%2Dus%2Drhode%2Disland&gCityName=Rhode+Island+%28State%29&SortBy=3&gDistance=0&page=", page_num)
html_content <- read_html(main_url)
profile_urls <- html_content %>% html_nodes("body")%>% html_children() %>% html_children() %>% .[2] %>% html_children() %>%
html_children() %>% .[3] %>% html_children() %>% .[4] %>% html_children() %>% html_children() %>% html_children() %>%
html_attr("href")
profile_url_lst[[page_num]] <- profile_urls
Sys.sleep(2)
}
#Bind into list and combine with domain name
profiles <- cbind(profile_urls)
complete_urls <- paste0('https://www.theeroticreview.com', profile_urls)
complete <- cbind(complete_urls)
complete
#Scrape information from each profile
TED_lst <- list()
base_url <- "https://www.theeroticreview.com"
completed <- c(profile_urls)
for(i in completed) {
urls <- paste(base_url, i, sep = "")
pages <- read_html(urls)
TED <- pages %>% html_nodes(".hidden-sm , .td-date .td-link , #collapse3 .col-sm-6+ .col-sm-6 .row:nth-child(5) .col-xs-6 , #collapse3 .row:nth-child(3) .col-xs-6 , .col-sm-6:nth-child(1) .row:nth-child(1) .col-xs-6 , .col-sm-6+ .col-sm-6 .row:nth-child(3) .col-xs-6 , .col-sm-6:nth-child(1) .row:nth-child(2) .col-xs-6 , #collapse1 .row:nth-child(8) .col-xs-6 , #collapse1 .col-sm-6:nth-child(2) .row:nth-child(5) .col-xs-6 , #collapse1 .col-sm-6+ .col-sm-6 .row:nth-child(1) .col-xs-6 , #collapse1 .col-sm-6:nth-child(2) .row:nth-child(2) .col-xs-6 , .float-heading-left p , h1") %>% html_text()
TED_lst <- TEDs
}
当我运行此代码时,我只能为单个页面生成完整的 URL,虽然从配置文件中抓取的代码使用这些 URL 之一并删除了循环函数,但尝试运行上述函数返回 NULL 和单个配置文件的 URL。我如何让它从多个配置文件中抓取所有信息,然后将其绑定到一个表中以用于回归分析?
答案 0 :(得分:0)
如果我理解正确,您想抓取 complete_urls
中列出的每个 URL,对吗?
在这种情况下,您的最终 for
循环应为:
for(i in completed) {
pages <- read_html(complete_urls[[i]]) # I changed this one
TED <- pages %>% html_nodes(".hidden-sm , .td-date .td-link , #collapse3 .col-sm-6+ .col-sm-6 .row:nth-child(5) .col-xs-6 , #collapse3 .row:nth-child(3) .col-xs-6 , .col-sm-6:nth-child(1) .row:nth-child(1) .col-xs-6 , .col-sm-6+ .col-sm-6 .row:nth-child(3) .col-xs-6 , .col-sm-6:nth-child(1) .row:nth-child(2) .col-xs-6 , #collapse1 .row:nth-child(8) .col-xs-6 , #collapse1 .col-sm-6:nth-child(2) .row:nth-child(5) .col-xs-6 , #collapse1 .col-sm-6+ .col-sm-6 .row:nth-child(1) .col-xs-6 , #collapse1 .col-sm-6:nth-child(2) .row:nth-child(2) .col-xs-6 , .float-heading-left p , h1") %>% html_text()
TED_lst <- TEDs
}