Question

因此，我试图对网页进行网页扫描，该网页包含不规则的数据块，这些数据块的组织方式很容易被人们发现。让我们想象一下，我们正在寻找维基百科。如果我正在从以下链接的文章中抓取文本，我最终得到33个条目。如果我只抓住标题，我最终只得到7（见下面的代码）。这个结果并不让我们感到惊讶，因为我们知道文章的某些部分有多个段落，而其他部分只有一个段落文本。

我的问题是，如何将我的标题与我的文本相关联。如果每个标题或某个倍数有相同数量的段落，这将是微不足道的。

library(rvest)
wiki <- html("https://en.wikipedia.org/wiki/Web_scraping")

wikitext <- wiki %>% 
  html_nodes('p+ ul li , p') %>%
  html_text(trim=TRUE)

wikiheading <- wiki %>% 
  html_nodes('.mw-headline') %>%
  html_text(trim=TRUE)

Answer 1

这将为您提供一个名为content的列表，其元素根据标题命名并包含相应的文本。

library(rvest) # Assumes version 0.2.0.9 is installed not currently on CRAN
wiki <- html("https://en.wikipedia.org/wiki/Web_scraping")

# This node set contains the headings and text
wikicontent <- wiki %>% 
  html_nodes("div[id='mw-content-text']") %>%
  xml_children()

# Locates the positions of the headings
headings <- sapply(wikicontent,xml_name) 
headings <- c(grep("h2",headings),length(headings)-1)

# Loop through the headings keeping the stuff in-between them as content
content <- list()
for (i in 1:(length(headings)-1)) {
  foo <- wikicontent[headings[i]:(headings[i+1]-1)]
  foo.title <- xml_text(foo[[1]])
  foo.content <- xml_text(foo[-c(1)])
  content[[i]] <- foo.content
  names(content)[i] <- foo.title
}

关键是发现mw-content-text节点，它包含你想要的所有孩子的东西。

R：Websscraping不规则的价值块

1 个答案: