单独的HTML文档

时间:2014-12-02 18:00:36

标签: html xml r xpath html-parsing

我正在使用R中的XML库,并希望将块中的HTML分开

myHTML <- htmlTreeParse("myHTMLfile.HTML", useInternal=T)
unlist(xpathApply(myHTML, '//div', xmlValue))

这很好用,并为我提供了一个长长的字符串向量。但是,理想情况下,我希望将我的HTML分成几块。 HTML结构如下:

    <DOC>
       <div>
           Document 1 - Element 1
       </div>

       <div>
           Document 1 - Element 2
       </div>

       <div>
           Document 1 - Element 3
       </div>

    </DOC>

    <DOC>
       <div>
           Document 2 - Element 1
       </div>

       <div>
           Document 2 - Element 2
       </div>

       <div>
           Document 2 - Element 3
       </div>

    </DOC>

所以我希望有一个列表,其中每个元素对应于一个内容,每个列表的元素是字符串向量,包含每个DOC的元素1,2,3。

我很难(A)甚至查询“DOC&#39;因为它不是命名空间的一部分? (B)得到这种字符串向量输出列表。

所以代替此输出

[1] "Document 1 - Element 1"
[2] "Document 1 - Element 2"
[3] "Document 1 - Element 3"
[4] "Document 2 - Element 1"
[5] "Document 2 - Element 2"
[6] "Document 2 - Element 3"

我希望得到这个:

[[1]]
    [1] "Document 1 - Element 1"
    [2] "Document 1 - Element 2"
    [3] "Document 1 - Element 3"
[[2]]
    [1] "Document 2 - Element 1"
    [2] "Document 2 - Element 2"
    [3] "Document 2 - Element 3"

非常感谢你的帮助!

以下是我要处理的html文件示例:

https://raw.githubusercontent.com/sytpp/sample-files/master/data_3.html

3 个答案:

答案 0 :(得分:2)

这个怎么样。

library(XML)
dd<-xmlInternalTreeParse("<DOCS><DOC>
       <div>Document 1 - Element 1</div>
       <div>Document 1 - Element 2</div>
       <div>Document 1 - Element 3</div>
</DOC><DOC>
       <div>Document 1 - Element 3</div>
       <div>Document 1 - Element 3</div>
       <div>Document 1 - Element 3</div>
</DOC></DOCS>")


xmlApply(dd["//DOC"], function(x) xpathSApply(x,".//div", xmlValue))

我们找到所有DOC元素,然后查找每个DOC的所有div,因此我们将外部xmlApply组合在一起,找到内部DIV的{​​{1}}元素,以从中提取文本xpathSApply

答案 1 :(得分:0)

这样的事情:

dat <- c("Document 1 - Element 1",
 "Document 1 - Element 2",
 "Document 1 - Element 3",
 "Document 2 - Element 1",
 "Document 2 - Element 2",
 "Document 2 - Element 3")

split(dat, sapply(strsplit(dat, " - " ), "[", 1))

## $`Document 1`
## [1] "Document 1 - Element 1"
## [2] "Document 1 - Element 2"
## [3] "Document 1 - Element 3"
## 
## $`Document 2`
## [1] "Document 2 - Element 1"
## [2] "Document 2 - Element 2"
## [3] "Document 2 - Element 3"

答案 2 :(得分:0)

这是另一种可能性。我们可以在readHTMLList

中使用getNodeSet作为函数调用
library(XML)
getNodeSet(xmlParseString(txt), "//DOC", fun = readHTMLList)
#[[1]]
#[1] "Document 1 - Element 1" "Document 1 - Element 2" "Document 1 - Element 3"
#
#[[2]]
#[1] "Document 2 - Element 1" "Document 2 - Element 2" "Document 2 - Element 3"

或者我们也可以尝试

lapply(xmlParseString(txt)["DOC"], readHTMLList)
# $DOC
# [1] "Document 1 - Element 1" "Document 1 - Element 2"
# [3] "Document 1 - Element 3"
# 
# $DOC
# [1] "Document 2 - Element 1" "Document 2 - Element 2"
# [3] "Document 2 - Element 3"

其中txt

txt <- "<DOC>\n       <div>\n           Document 1 - Element 1\n       </div>\n\n       <div>\n           Document 1 - Element 2\n       </div>\n\n       <div>\n           Document 1 - Element 3\n       </div>\n\n    </DOC>\n\n    <DOC>\n       <div>\n           Document 2 - Element 1\n       </div>\n\n       <div>\n           Document 2 - Element 2\n       </div>\n\n       <div>\n           Document 2 - Element 3\n       </div>\n\n    </DOC>"

从您给定的网址中,我得到了以下结果

library(RCurl)
content <- getURL(url)
doc <- htmlTreeParse(content, useInternal=TRUE)
values <- getNodeSet(doc, "//div", fun = xmlValue, trim = TRUE)
str(values[1:6])
# List of 6
# $ : chr "1 of 3 DOCUMENTS"
# $ : chr "The Daily Telegraph (London)"
# $ : chr "November 1, 2014 Saturday  Edition 1; National Edition"
# $ : chr "THE WEEK IN WESTMINSTER"
# $ : chr "SECTION: FEATURES; Pg. 26"
# $ : chr "LENGTH: 500 words"
length(values)
#[1] 39