Question

我正在使用rvest抓取网页，并使用purrr::map_df将收集的数据转换为数据框。我遇到的问题是，并非所有网页都在我指定的每个html_nodes上都有内容，而map_df却忽略了这些不完整的网页。我希望map_df包含上述网页，并在NA与内容不匹配的地方写html_nodes。输入以下代码：

library(rvest)
library(tidyverse)

urls <- list("https://en.wikipedia.org/wiki/FC_Barcelona",
             "https://en.wikipedia.org/wiki/Rome", 
             "https://es.wikipedia.org/wiki/Curic%C3%B3")
h <- urls %>% map(read_html)

out <- h %>% map_df(~{
  a <- html_nodes(., "#firstHeading") %>% html_text()
  b <- html_nodes(., "#History") %>% html_text()
  df <- tibble(a, b)
})
out

以下是输出：

> out
# A tibble: 2 x 2
  a            b      
  <chr>        <chr>  
1 FC Barcelona History
2 Rome         History

这里的问题是，输出数据框不包含与#History html节点（在本例中为第三个url）不匹配的网站的行。我想要的输出看起来像这样：

> out
# A tibble: 2 x 3
  a            b      
  <chr>        <chr>  
1 FC Barcelona History
2 Rome         History
3 Curicó       NA

任何帮助将不胜感激！

Answer 1

您只需签入map_df部分。由于html_nodes不存在时会返回character(0)，因此请检查a和b

的长度

out <- h %>% map_df(~{
  a <- html_nodes(., "#firstHeading") %>% html_text()
  b <- html_nodes(., "#History") %>% html_text()

  a <- ifelse(length(a) == 0, NA, a)
  b <- ifelse(length(b) == 0, NA, b)

  df <- tibble(a, b)
})
out

# A tibble: 3 x 2
  a            b      
  <chr>        <chr>  
1 FC Barcelona History
2 Rome         History
3 Curicó       NA

R：使用rvest和purrr：map_df构建数据框：如何处理不完整的输入

1 个答案: