Question

运行代码时出现此错误：

Error in data.frame(date = html_text(html_nodes(pagina, ".node-post-date")),  : 
  arguments imply differing number of rows: 9, 10

在页面983中抓取标签时，我仅获得9个结果（而不是通常每个页面10个结果）。我认为发生这种情况的原因是，在该网页中，我要抓取的日期之一与我正在使用的日期具有不同的标记。

我对R还是很陌生，所以我不知道如何在我的代码中运行if语句来获取未得到结果的NA。

这是我的代码：

#Libraries
library(rvest)
library(purrr)
library(tidytext)
library(dplyr)

url_espectador <- 'https://www.elespectador.com/search/farc?page=%d&sort=created&order=desc'

map_df(980:990, function(i) {

  pagina <- read_html(sprintf(url_espectador, i, '%s', '%s', '%s', '%s'))
  print(i)

  data.frame(title = html_text(html_nodes(pagina, ".node-title a")),
             date = html_text(html_nodes(pagina, ".node-post-date")),
             link = paste0("https://www.elespectador.com", str_trim(html_attr(html_nodes(pagina, ".node-title a"), "href"))),
             stringsAsFactors=FALSE)
  }) -> noticias_espectador

除了if语句外，还有其他解决方案吗？我将刮擦大量页面，因此我需要避免此行匹配问题。感谢您的帮助！

Answer 1

您可以使用css或语法添加另一个类（适合于少量其他类）。

或者，您可以选择一个共享的父节点，测试是否存在特定的子节点，否则返回NA。 This的答案向您展示了后一种方法。如果使用后者，则可以使用选择器.node--search-result来获得合适的父节点-您可能会错过实际感兴趣的子节点（在这种情况下，使用不同的类）-但代码不会出错。

还有第三种选择-在观察到的情况下，类具有共同的后缀，因此可以使用带有contains（*）或ends with的{{3}} css选择器（$）运算符，例如date = html_text(html_nodes(pagina, "[class$='post-date']"))。

library(rvest)
library(purrr)
library(tidytext)
library(dplyr)

url_espectador <- 'https://www.elespectador.com/search/farc?page=%d&sort=created&order=desc'

map_df(980:990, function(i) {

  pagina <- read_html(sprintf(url_espectador, i, '%s', '%s', '%s', '%s'))
  print(i)

  data.frame(title = html_text(html_nodes(pagina, ".node-title a")),
             date = html_text(html_nodes(pagina, ".node-post-date, .field--name-post-date")),
             link = paste0("https://www.elespectador.com", str_trim(html_attr(html_nodes(pagina, ".node-title a"), "href"))),
             stringsAsFactors=FALSE)
}) -> noticias_espectador

抓取网站时，参数暗示行数不同

1 个答案: