Question

我正在尝试从网页上抓取一个html表格。但是，该页面包含许多我不想删除的html表。为了识别我想要删除的表，我想使用第一个跟在特定单词组合之后的表（单词组合不在表中但是是文本的一部分）。这是一个例子：

这是我感兴趣的表格：

library(XML)
url <- "http://www.sec.gov/Archives/edgar/data/1301063/000119312514133663/0001193125-14-133663.txt"
readHTMLTable(url, trim = T, header = F, stringsAsFactors = F)[29]

我想用来检测表的标准是它是第一个跟随这个单词组合的表：

“安全，健康，环境和可持续性挑战”

html <- getURL(url, followlocation = TRUE)
doc <- htmlParse(html, asText = TRUE)
text <- xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue)
grep("safety, health, environmental and sustainability challenges", text, value = T)

紧跟指定文本后提取html表

0 个答案: