我正在尝试开发一种代码,以结合RSelenium和rvest来解决问题,其中在抓取许多网站列表时,仅rvest总是超时。
由于单独使用rvest无效,RSelenium可以通过循环打开和关闭列表中的每个网站来解决问题,但是如果网站列表很长,恐怕这种方法可能会花费很长时间。
我尝试结合以前的代码,并使用RSelenium在多个网站中添加新的循环,但看起来不起作用。
library(xml2)
library(dplyr)
library(readr)
library(RSelenium)
webpages <- data.frame(name = c("amazon", "apple", "usps", "yahoo", "bbc", "ted", "surveymonkey", "forbes", "imdb", "hp"),
url = c("http://www.amazon.com",
"http://www.apple.com",
"http://www.usps.com",
"http://www.yahoo.com",
"http://www.bbc.com",
"http://www.ted.com",
"http://www.surveymonkey.com",
"http://www.forbes.com",
"http://www.imdb.com",
"http://www.hp.com"))
driver <- rsDriver(browser = c("chrome"))
remDr <- driver[["client"]]
i <- 1
while(i <= 4){
url <- webpages$url[i]
remDr$navigate(url)
page_source <- remDr$getPageSource()
URL <- read_html(page_source)
results <- URL %>% html_nodes("head")
records <- vector("list", length = length(results))
for (i in seq_along(records)) {
title <- xml_contents(results[i] %>%
html_nodes("title"))[1] %>% html_text(trim = TRUE)
description <- results[i] %>%
html_nodes("meta[name=description]") %>% html_attr("content")
keywords <- results[i] %>%
html_nodes("meta[name=keywords]") %>% html_attr("content")
}
i <- i + 1
remDr$close()
return(data.frame(name = x['name'],
url = x['url'],
title = ifelse(length(title) > 0, title, NA),
description = ifelse(length(description) > 0, desc, NA),
keywords = ifelse(length(keywords) > 0, kw, NA)))
}
我现在遇到的错误是:
Error in UseMethod("read_xml") :
no applicable method for 'read_xml' applied to an object of class "list"
我想要的结果是这样的:
url title description keywords
http://www.apple.com Apple website description keywords
http://www.amazon.com Amazon website description keywords
http://www.usps.com Usps website description keywords
http://www.yahoo.com Yahoo website description keywords
http://www.bbc.com Bbc website description keywords
http://www.ted.com Ted website description keywords
http://www.surveymonkey.com Survey Monkey website description keywords
http://www.forbes.com Forbes website description keywords
http://www.imdb.com Imdb website description keywords
http://www.hp.com Hp website description keywords
答案 0 :(得分:1)
您只需要将page_source
更改为page_source[[1]]
,并且对变量命名(例如索引器,向量)和调用要多加注意。我还建议您在使用此类循环时打印一些消息。此外,如果删除remDr$close()
,可以避免连接断开。最后,您可以将结果存储在循环变量之外:
scrapped = list()
i <- 1
while(i <= 4){
url <- webpages$url[i]
print( paste("Accessing to:", url) )
remDr$navigate(url)
page_source <- remDr$getPageSource()
URL <- read_html(page_source[[1]])
results <- URL %>% html_nodes("head")
records <- vector("list", length = length(results))
for (ii in seq_along(records)) {
title <- xml_contents(results[ii] %>% html_nodes("title"))[1] %>%
html_text(trim = TRUE)
desc <- results[ii] %>%
html_nodes("meta[name=description]") %>%
html_attr("content")
keywords <- results[ii] %>%
html_nodes("meta[name=keywords]") %>%
html_attr("content")
}
#remDr$close()
scrapped[[i]] = data.frame(name = webpages[i,'name'],
url = webpages[i,'url'],
title = ifelse(length(title) > 0, title, NA),
description = ifelse(length(desc) > 0, desc, NA),
keywords = ifelse(length(keywords) > 0, keywords, NA))
i = i + 1
}
输出
do.call('rbind', scrapped)
# name url title
#1 amazon http://www.amazon.com Amazon.com: Online Shopping for Electronics, Apparel, Computers, Books, DVDs & more
#2 apple http://www.apple.com Apple
#3 usps http://www.usps.com Welcome | USPS
#4 yahoo http://www.yahoo.com Yahoo
description
#1 Online shopping from the earth's biggest selection of books, magazines, music, DVDs, videos, electronics, computers, software, apparel & accessories, shoes, jewelry, tools & hardware, housewares, furniture, sporting goods, beauty & personal care, broadband & dsl, gourmet food & just about anything else.
#2 <NA>
#3 Welcome to USPS.com. Find information on our most convenient and affordable shipping and mailing services. Use our quick tools to find locations, calculate prices, look up a ZIP Code, and get Track & Confirm info.
#4 Las noticias, el correo electrónico y las búsquedas son tan solo el comienzo. Descubre algo nuevo todos los días en Yahoo.
#keywords
#1 Amazon, Amazon.com, Books, Online Shopping, Book Store, Magazine, Subscription, Music, CDs, DVDs, Videos, Electronics, Video Games, Computers, Cell Phones, Toys, Games, Apparel, Accessories, Shoes, Jewelry, Watches, Office Products, Sports & Outdoors, Sporting Goods, Baby Products, Health, Personal Care, Beauty, Home, Garden, Bed & Bath, Furniture, Tools, Hardware, Vacuums, Outdoor Living, Automotive Parts, Pet Supplies, Broadband, DSL
#2 <NA>
#3 Quick Tools, Shipping Services, Mailing Services, Village Post Office, Ship Online, Flat Rate, Postal Store, Ship a Package, Send Mail, Manage Your Mail, Business Solutions, Find Locations, Calculate a Price, Look Up a ZIP Code, Track Packages, Print a Label, Stamps
#4 yahoo, yahoo inicio, yahoo página de inicio, yahoo búsqueda, correo yahoo, yahoo messenger, yahoo juegos, noticias, finanzas, deportes, entretenimiento