Question

我一直都在努力使用R中的xml包，我需要一些帮助，用xml2来抓取一些格式良好的表。

我要抓的第一页表格的网址是 here。在某些页面上我想要第二个和第三个表，但在其他页面上我想要第一个和第二个表。一个常见的线索是，我想要所有的表格，其中'caption'标签包含“满足”文本“刮擦”并存储在一个列表中，以及“caption”标签包含文本'不符合任何'的表格。但我真的不知道该怎么做。我正在使用的代码如下。我可以想象必须有一种方法可以使regexp成为选择整个表的条件。希望代码有效。

#Define urls
urls<-lapply(seq(1,12, 1), function(x) paste('http://www.chemicalsubstanceschimiques.gc.ca/challenge-defi/batch-lot-',x,'/index-eng.php', sep=''))
#scrap the text
batches<-lapply(urls, function(x) read_html(x))
#Return the tables from each 
batches_tables<-lapply(batches, function(x) xml_find_all(x, './/table'))
#get the table from the first
out<-batches[[1]]
#Inspect
out[[1]] #do not want this table
out[[2]] #want this table pasted in one list, caption='that meet'
out[[2]] #want this table pasted in a second list, caption='that do not meet'

Answer 1

使用caption定位contains()代码，然后向上移至父代：

library(xml2)
library(rvest)

URL <- "http://www.chemicalsubstanceschimiques.gc.ca/challenge-defi/batch-lot-1/index-eng.php#s1"
pg <- read_html(URL)

html_nodes(pg, xpath=".//table/caption[contains(., 'that meet')]/..")
## {xml_nodeset (1)}
## [1] <table class="fontSize80">&#13;\n          <caption>&#13;\n          ...

html_nodes(pg, xpath=".//table/caption[contains(., 'that do not meet')]/..")
## {xml_nodeset (1)}
## [1] <table class="fontSize85">&#13;\n          <caption>&#13;\n          ...

使用xml2刮取前两列Web表

1 个答案: