我正在使用R的XML包来提取各种html和xml文件中的所有可能数据。这些文件基本上是文档或构建属性或自述文件。
<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE chapter PUBLIC '-//OASIS//DTD DocBook XML V4.1.2//EN'
'http://www.oasis-open.org/docbook/xml/4.0 docbookx.dtd'>
<chapter lang="en">
<chapterinfo>
<author>
<firstname>Jirka</firstname>
<surname>Kosek</surname>
</author>
<copyright>
<year>2001</year>
<holder>Jiří Kosek</holder>
</copyright>
<releaseinfo>$Id: htmlhelp.xml,v 1.1 2002/05/15 17:22:31 isberg Exp $</releaseinfo>
</chapterinfo>
<title>Using XSL stylesheets to generate HTML Help</title>
<?dbhtml filename="htmlhelp.html"?>
<para>HTML Help (HH) is help-format used in newer versions of MS
Windows and applications written for this platform. This format allows
to pack several HTML files together with images, table of contents and
index into single file. Windows contains browser for this file-format
and full-text search is also supported on HH files. If you want know
more about HH and its capabilities look at <ulink
url="http://msdn.microsoft.com/library/tools/htmlhelp/chm/HH1Start.htm">HTML
Help pages</ulink>.</para>
<section>
<title>How to generate first HTML Help file from DocBook sources</title>
<para>Working with HH stylesheets is same as with other XSL DocBook
stylesheets. Simply run your favorite XSLT processor on your document
with stylesheet suited for HH:</para>
</section>
</chapter>
我的目标是在使用htmlTreeParse或xmlTreeParse解析树之后使用xmlValue,使用类似的东西(对于xml文件..)
Text = xmlValue(xmlRoot(xmlTreeParse(XMLFileName)))
但是,当我为xml和html文件执行此操作时,会出现一个错误。如果级别为2或更多的子节点,则文本字段将被粘贴,而它们之间没有任何空格。
例如,在上面的例子中
xmlValue(chapterInfo)是
JirkaKosek2001JiKosek$Id: htmlhelp.xml,v 1.1 2002/05/15 17:22:31 isberg Exp
将每个子节点的xmlValues(递归)粘贴在一起,而不在它们之间添加空格。如何在提取此数据时让xmlValue添加空格
非常感谢您的帮助,
Shivani
答案 0 :(得分:3)
根据文档,xmlValue
仅适用
在单个文本节点上,或“包含单个文本节点的XML节点”上。
显然没有保留非文本节点中的空格。
但是,即使是单个文本节点, 你的代码会剥离空格。
library(XML)
doc <- xmlTreeParse("<a> </a>")
xmlValue(xmlRoot(doc))
# [1] ""
您可以添加ignoreBlanks=FALSE
和useInternalNodes=TRUE
xmlTreeParse
的参数,以保留所有空格。
doc <- xmlTreeParse(
"<a> </a>",
ignoreBlanks = FALSE,
useInternalNodes = TRUE
)
xmlValue(xmlRoot(doc))
# [1] " "
# Spaces inside text nodes are preserved
doc <- xmlTreeParse(
"<a>foo <b>bar</b></a>",
ignoreBlanks = FALSE,
useInternalNodes = TRUE
)
xmlValue(xmlRoot(doc))
# [1] "foo bar"
# Spaces between text nodes (inside non-text nodes) are not preserved
doc <- xmlTreeParse(
"<a><b>foo</b> <b>bar</b></a>",
ignoreBlanks = FALSE,
useInternalNodes = TRUE
)
xmlValue(xmlRoot(doc))
# [1] "foobar"