Question

我正在尝试从html站点提取一些数据。我有500个节点，其中应包含一个日期，一个标题和一个摘要。通过使用

url <- "https://www.bild.de/suche.bild.html?type=article&query=Migration&resultsPerPage=1000"
html_raw <- xml2::read_html(url)
main_node <- xml_find_all(html_raw, "//section[@class='query']/ol") %>%
  xml_children()

xml_find_all(main_node, ".//time") #time
xml_find_all(main_node, ".//span[@class='headline']") #title
xml_find_all(main_node, ".//p[@class='entry-content']") #summary

返回三个带有日期，标题和摘要的向量，然后可以将它们编织在一起。至少在理论上。不幸的是，我的代码发现500个日期，500个标题，但只有499个摘要。原因是其中一个节点丢失了。

这给我留下了问题，因为长度不同，我无法将其绑定到数据帧中。摘要与确切的日期和标题不匹配。

一个简单的解决方案是遍历节点，并用占位符（如“ NA”）替换空节点。

dates <- c()
titles <- c()
summaries <- c()

for(i in 1:length(main_node)){
  date_temp <- xml_find_all(main_node[i], ".//time") %>%
    xml_text(trim = TRUE) %>%
    as.Date(format = "%d.%m.%Y")
  title_temp <- xml_find_all(main_node[i], ".//span[@class='headline']") %>%
    xml_text(trim = TRUE)
  summary_temp <- xml_find_all(main_node[i], ".//p[@class='entry-content']") %>%
    xml_text(trim = TRUE)

  if(length(summary_temp) == 0) summary_temp <- "NA"

  dates <- c(dates, date_temp)
  titles <- c(titles, title_temp)
  summaries <- c(summaries, summary_temp)
}

但这会使简单的三行代码变得不必要。所以我想我的问题是：有没有比循环更复杂的方法？

Answer 1

您可以使用purrr库来帮助和避免显式循环

library(purrr)
dates <- main_node %>% map_chr(. %>% xml_find_first(".//time") %>% xml_text())
titles <- main_node %>% map_chr(. %>% xml_find_first(".//span[@class='headline']") %>% xml_text())
summaries <- main_node %>% map_chr(. %>% xml_find_first(".//p[@class='entry-content']") %>% xml_text())

这利用了以下事实：如果未找到@ Dave2e指出的元素，xml_find_first将返回NA。

但是通常来说，通过在循环中附加每个迭代来生成列表在R中效率很低。最好预先分配向量（因为它将具有已知的长度），然后将每个迭代的值分配给适当的值插槽（out[i] <- val）。 R中的循环本身并没有错。仅仅是内存重新分配会减慢速度。

XML2-Package：如何处理空节点？

1 个答案: