Question

我正在尝试使用R的XML库从网站中提取一些信息。

我已经下载了一个网页。然后，我使用Xpath表达式从页面中提取一些相关元素。通常，这导致这些相关元素中的大约50个。然后我想将这些相关项（XMLNodeSet）保存为XML文档（因此我可以在XML编辑器中分析结果）。

但是。以前，我可以保存XMLNodeSet，我需要在使用XML :: saveXML（）函数之前将它们转换为格式良好的xml文档。

有没有人有任何想法如何使用R＆R的XML包做到这一点。以下是代码段：

download.file("https://www.holidayhouses.co.nz/Browse/List.aspx?page=37", "data.html")
doc <- htmlParse("data.html")
# set up x-path
str_x_path_lccg <- "//div[@class = 'ListCard-content group']"
# extract relevant nodes
xml_relevant_nodes <- XML::getNodeSet(doc, str_x_path_lccg)
# need to convert xml_relevant_nodes into a well-formed xml document in order to save it
# therefore the following fails
XML::saveXML(xml_relevant_nodes, "test.xml")

任何想法......？

Answer 1

自从提出问题以来，我已经学习了更多有关R＆R的XML包的知识。以下是最初提出的问题的答案：

download.file("https://www.holidayhouses.co.nz/Browse/List.aspx?page=37", "data.html")
doc <- htmlParse("data.html")
# set up x-path
str_x_path_lccg <- "//div[@class = 'ListCard-content group']"
# extract relevant nodes
xml_relevant_nodes <- XML::getNodeSet(doc, str_x_path_lccg)
# need to convert xml_relevant_nodes into a well-formed xml document in order to save it
# firstly, create a single node which will be the parent
xmlDoc = newXMLNode("top", "topNode", namespace = c(tfm = "http://www.thefactmachine.com"))
# now we can add the node set to the parent node
addChildren(xmlDoc, kids = xml_relevant_nodes)
XML::saveXML(xmlDoc, "test.xml")

将XMLNodeSet转换为格式良好的XML文档

1 个答案: