Question

我有一个网页网址列表，它们都是同一页面，只是提供不同的信息。

像这样：

http://www.halfordsautocentres.com/autocentres/chesterfield
http://www.halfordsautocentres.com/autocentres/derby-london-road
http://www.halfordsautocentres.com/autocentres/derby-wyvern-way

每个人在CSS选择器.store-details__address。

下都有不同的地址

我编写了以下代码，为单个页面输出正确的地址：

derby <- read_html("http://www.halfordsautocentres.com/autocentres/derby-wyvern-way")
derby %>%
+   html_node(".store-details__address") %>%
+   html_text()
[1] "Unit 7, Wyvern Way, Wyvern Retail Park, Derby, DE21 6NZ"

如何让read_html读取网址列表而不只是一个网址？

感谢。

Answer 1

您可以使用所需的任何循环策略：for，lapply，purrr::map。

require(rvest)
urls <- c("http://www.halfordsautocentres.com/autocentres/chesterfield",
          "http://www.halfordsautocentres.com/autocentres/derby-london-road",
          "http://www.halfordsautocentres.com/autocentres/derby-wyvern-way")

使用for循环

进行 R

out <- vector("character", length = length(urls))
for(i in seq_along(urls)){
  derby <- read_html(urls[i])
  out[i] <- derby %>%
    html_node(".store-details__address") %>%
    html_text()
}

以<{1}}

为基础 R

*apply

这是urls %>% lapply(read_html) %>% lapply(html_node, ".store-details__address") %>% vapply(html_text, character(1))

tidyverse/purrr

如何使用rvest的read_html来读取HTML文件列表？

1 个答案: