我无法创建到read_html
的循环并提取所需的信息。我能够创建一个从一个网站提取的循环。
例如:以下是我的代码,用于从Amazon网站提取标题,描述和关键字。
URL <- read_html("http://www.amazon.com")
library(rvest)
results <- URL %>% html_nodes("head")
library(dplyr)
records <- vector("list", length = length(results))
for (i in seq_along(records)) {
title <- xml_contents(results[i] %>% html_nodes("title"))[1] %>% html_text(trim = TRUE)
description <- html_nodes(results[i], "meta[name=description]") %>% html_attr("content")
keywords <- html_nodes(results[i], "meta[name=keywords]") %>% html_attr("content")
records[[i]] <- data.frame(title = title, description = description, keywords = keywords)
}
但是,如果有的话怎么办
name <- c("amazon", "apple", "usps")
url <- c("http://www.apple.com,
"http://www.amazon.com",
"http://www.usps.com")
webpages <- data.frame(name, url)
如何将read_html
包含在我创建的现有循环中,以提取所需的信息并包含URL名称。
所需的输出示例
url title description keywords
http://www.apple.com Apple Apple's website description Apple, iPhone, iPad
http://www.amazon.com Amazon Amazon's website description Shopping, Home, Online
http://www.usps.com USPS USPS's website description Shipping, Postage, Stamps
感谢您的所有建议。
答案 0 :(得分:2)
类似的事情可能对您有用。
library(rvest)
library(dplyr)
webpages <- data.frame(name = c("amazon", "apple", "usps"),
url = c("http://www.amazon.com",
"http://www.apple.com",
"http://www.usps.com"))
webpages <- apply(webpages, 1, function(x){
URL <- read_html(x['url'])
results <- URL %>% html_nodes("head")
records <- vector("list", length = length(results))
for (i in seq_along(records)) {
title <- xml_contents(results[i] %>% html_nodes("title"))[1] %>% html_text(trim = TRUE)
desc <- html_nodes(results[i], "meta[name=description]") %>% html_attr("content")
kw <- html_nodes(results[i], "meta[name=keywords]") %>% html_attr("content")
}
return(data.frame(name = x['name'],
url = x['url'],
title = ifelse(length(title) > 0, title, NA),
description = ifelse(length(desc) > 0, desc, NA),
kewords = ifelse(length(kw) > 0, kw, NA)))
})
webpages <- do.call(rbind, webpages)