从R中的本地html文件中删除所有值

时间:2014-01-21 23:18:11

标签: r web-scraping scrape

我有html(本地)文件,如下所示:

enter image description here

会有人如此善良,并告诉我如何处理这种情况,在这种布局下刮几行吗?

这是许多不成功的试验之一:

library(XML)
example.html <- scan(file=file.choose(),what="character")
parse.html <- htmlTreeParse(example.html, useInternalNodes = TRUE)
xpath.val <- xpathApply(parse.html, '//div', xmlValue)
g.val <- gsub('\\s', '', xpath.val)

如果有人有兴趣看到html文件本身是here

编辑:我当然不希望任何人解决这个问题。我会很高兴看到在哪里看。

1 个答案:

答案 0 :(得分:1)

好的,这并不能让你完全相同,但也许这会有所帮助

library(XML)
library(stringr)
namespaces=c(xmlns="http://www.xbrl.org/2008/inlineXBRL")
parse.html <- htmlTreeParse("~/Downloads/html1.html", useInternalNodes=TRUE)
tt <- xpathApply(parse.html, '//tr[@class="iris_table_row"]', namespaces=namespaces)
foo <- function(x){
  vals <- sapply(xmlChildren(x), xmlValue)
  str_trim(vals[names(vals) %in% "td" & sapply(vals, nchar)>0], "both")
}
rows <- lapply(tt, foo)
rows[170:175]

[[1]]
 td 
"%" 

[[2]]
                td                 td 
"Class of shares:"          "holding" 

[[3]]
        td         td 
"Ordinary"   "100.00" 

[[4]]
            td             td 
      "Page 5" "continued..." 

[[5]]
                                                      td 
"Whitton Park Estates Limited (Registered number: 00231549)" 

[[6]]
                                         td 
"Notes to the Abbreviated Accounts - continued"