我正在使用R中的XML库,并希望将块中的HTML分开
myHTML <- htmlTreeParse("myHTMLfile.HTML", useInternal=T)
unlist(xpathApply(myHTML, '//div', xmlValue))
这很好用,并为我提供了一个长长的字符串向量。但是,理想情况下,我希望将我的HTML分成几块。 HTML结构如下:
<DOC>
<div>
Document 1 - Element 1
</div>
<div>
Document 1 - Element 2
</div>
<div>
Document 1 - Element 3
</div>
</DOC>
<DOC>
<div>
Document 2 - Element 1
</div>
<div>
Document 2 - Element 2
</div>
<div>
Document 2 - Element 3
</div>
</DOC>
所以我希望有一个列表,其中每个元素对应于一个内容,每个列表的元素是字符串向量,包含每个DOC的元素1,2,3。
我很难(A)甚至查询“DOC&#39;因为它不是命名空间的一部分? (B)得到这种字符串向量输出列表。
所以代替此输出
[1] "Document 1 - Element 1"
[2] "Document 1 - Element 2"
[3] "Document 1 - Element 3"
[4] "Document 2 - Element 1"
[5] "Document 2 - Element 2"
[6] "Document 2 - Element 3"
我希望得到这个:
[[1]]
[1] "Document 1 - Element 1"
[2] "Document 1 - Element 2"
[3] "Document 1 - Element 3"
[[2]]
[1] "Document 2 - Element 1"
[2] "Document 2 - Element 2"
[3] "Document 2 - Element 3"
非常感谢你的帮助!
以下是我要处理的html文件示例:
https://raw.githubusercontent.com/sytpp/sample-files/master/data_3.html
答案 0 :(得分:2)
这个怎么样。
library(XML)
dd<-xmlInternalTreeParse("<DOCS><DOC>
<div>Document 1 - Element 1</div>
<div>Document 1 - Element 2</div>
<div>Document 1 - Element 3</div>
</DOC><DOC>
<div>Document 1 - Element 3</div>
<div>Document 1 - Element 3</div>
<div>Document 1 - Element 3</div>
</DOC></DOCS>")
xmlApply(dd["//DOC"], function(x) xpathSApply(x,".//div", xmlValue))
我们找到所有DOC元素,然后查找每个DOC的所有div,因此我们将外部xmlApply
组合在一起,找到内部DIV
的{{1}}元素,以从中提取文本xpathSApply
答案 1 :(得分:0)
这样的事情:
dat <- c("Document 1 - Element 1",
"Document 1 - Element 2",
"Document 1 - Element 3",
"Document 2 - Element 1",
"Document 2 - Element 2",
"Document 2 - Element 3")
split(dat, sapply(strsplit(dat, " - " ), "[", 1))
## $`Document 1`
## [1] "Document 1 - Element 1"
## [2] "Document 1 - Element 2"
## [3] "Document 1 - Element 3"
##
## $`Document 2`
## [1] "Document 2 - Element 1"
## [2] "Document 2 - Element 2"
## [3] "Document 2 - Element 3"
答案 2 :(得分:0)
这是另一种可能性。我们可以在readHTMLList
getNodeSet
作为函数调用
library(XML)
getNodeSet(xmlParseString(txt), "//DOC", fun = readHTMLList)
#[[1]]
#[1] "Document 1 - Element 1" "Document 1 - Element 2" "Document 1 - Element 3"
#
#[[2]]
#[1] "Document 2 - Element 1" "Document 2 - Element 2" "Document 2 - Element 3"
或者我们也可以尝试
lapply(xmlParseString(txt)["DOC"], readHTMLList)
# $DOC
# [1] "Document 1 - Element 1" "Document 1 - Element 2"
# [3] "Document 1 - Element 3"
#
# $DOC
# [1] "Document 2 - Element 1" "Document 2 - Element 2"
# [3] "Document 2 - Element 3"
其中txt
是
txt <- "<DOC>\n <div>\n Document 1 - Element 1\n </div>\n\n <div>\n Document 1 - Element 2\n </div>\n\n <div>\n Document 1 - Element 3\n </div>\n\n </DOC>\n\n <DOC>\n <div>\n Document 2 - Element 1\n </div>\n\n <div>\n Document 2 - Element 2\n </div>\n\n <div>\n Document 2 - Element 3\n </div>\n\n </DOC>"
从您给定的网址中,我得到了以下结果
library(RCurl)
content <- getURL(url)
doc <- htmlTreeParse(content, useInternal=TRUE)
values <- getNodeSet(doc, "//div", fun = xmlValue, trim = TRUE)
str(values[1:6])
# List of 6
# $ : chr "1 of 3 DOCUMENTS"
# $ : chr "The Daily Telegraph (London)"
# $ : chr "November 1, 2014 Saturday Edition 1; National Edition"
# $ : chr "THE WEEK IN WESTMINSTER"
# $ : chr "SECTION: FEATURES; Pg. 26"
# $ : chr "LENGTH: 500 words"
length(values)
#[1] 39