如何在R中使用read_html遍历多个网站?

时间:2019-02-11 19:33:31

标签: r web-scraping

我无法创建到read_html的循环并提取所需的信息。我能够创建一个从一个网站提取的循环。

例如:以下是我的代码,用于从Amazon网站提取标题,描述和关键字。

URL <- read_html("http://www.amazon.com")
library(rvest)
results <- URL %>% html_nodes("head")

library(dplyr)
records <- vector("list", length = length(results))

for (i in seq_along(records)) {
  title <- xml_contents(results[i] %>% html_nodes("title"))[1] %>% html_text(trim = TRUE)
  description <- html_nodes(results[i], "meta[name=description]") %>% html_attr("content")
  keywords <- html_nodes(results[i], "meta[name=keywords]") %>% html_attr("content")
  records[[i]] <- data.frame(title = title, description = description, keywords = keywords)
}

但是,如果有的话怎么办

name <- c("amazon", "apple", "usps")
url <- c("http://www.apple.com,
             "http://www.amazon.com",
             "http://www.usps.com")
    webpages <- data.frame(name, url)

如何将read_html包含在我创建的现有循环中,以提取所需的信息并包含URL名称。

所需的输出示例

url                      title            description               keywords
http://www.apple.com     Apple    Apple's website description     Apple, iPhone, iPad
http://www.amazon.com    Amazon   Amazon's website description    Shopping, Home, Online
http://www.usps.com      USPS     USPS's website description      Shipping, Postage, Stamps

感谢您的所有建议。

1 个答案:

答案 0 :(得分:2)

类似的事情可能对您有用。

library(rvest)
library(dplyr)

webpages <- data.frame(name = c("amazon", "apple", "usps"),
                        url = c("http://www.amazon.com",
                                "http://www.apple.com",
                                "http://www.usps.com"))


webpages <- apply(webpages, 1, function(x){
  URL <- read_html(x['url'])

  results <- URL %>% html_nodes("head")

  records <- vector("list", length = length(results))

  for (i in seq_along(records)) {
    title <- xml_contents(results[i] %>% html_nodes("title"))[1] %>% html_text(trim = TRUE)
    desc <- html_nodes(results[i], "meta[name=description]") %>% html_attr("content")
    kw <- html_nodes(results[i], "meta[name=keywords]") %>% html_attr("content")
  }

  return(data.frame(name = x['name'],
                    url = x['url'],
                    title = ifelse(length(title) > 0, title, NA),
                    description = ifelse(length(desc) > 0, desc, NA),
                    kewords = ifelse(length(kw) > 0, kw, NA)))
})

webpages <- do.call(rbind, webpages)