Question

我有一个xml文件，如下所示：

<?xml version="1.0"?>
<Book>
  <Title>Ulysses</Title>
  <Author>James <b>Joyce</b></Author>
</Book>

我需要使用Java将其解析为类似

的pojo

title="Ulysses"
author="James <b>Joyce</b>"

换句话说，我需要在解析时将html或可能的自定义xml标记保留为纯文本而不是xml元素。

我根本无法编辑XML，但我可以创建自定义xslt文件来转换xml。

我有以下Java代码使用xslt来帮助读取xml，

TransformerFactory factory = TransformerFactory.newInstance();
    Source stylesheetSource = new StreamSource(new File(stylesheetPathname).getAbsoluteFile());
    Transformer transformer = factory.newTransformer(stylesheetSource);
    Source inputSource = new StreamSource(new File(inputPathname).getAbsoluteFile());
    Result outputResult = new StreamResult(new File(outputPathname).getAbsoluteFile());
    transformer.transform(inputSource, outputResult);

这确实将我的xslt应用于写出的文件，但是我无法提供正确的xslt来执行此操作。我看了Add CDATA to an xml file，但这对我不起作用。

基本上，我相信我希望文件看起来像

<?xml version="1.0"?>
<Book>
  <Title>Ulysses</Title>
  <Author><![CDATA[James <b>Joyce</b>]]></Author>
</Book>

然后我可以提取 "James <b>Joyce</b>"。我尝试了这里建议的方法：Add CDATA to an xml file 但它对我不起作用。

我使用了以下xslt：

 <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes" omit-xml-declaration="no"/>

<xsl:template match="Author">
<xsl:copy>
<xsl:text disable-output-escaping="yes">&lt;![CDATA[</xsl:text>
<xsl:copy-of select="*"/>    
<xsl:text disable-output-escaping="yes">]]&gt;</xsl:text>
</xsl:copy>
</xsl:template>

这产生了：

<?xml version="1.0" encoding="UTF-8"?>
  Ulysses
  <Author><![CDATA[
<b>Joyce</b>]]></Author>

你能帮忙吗？我希望原始文档以完整的方式写出，但CDATA包含author元素中的所有内容。感谢

Answer 1

使用Saxon 9.8 HE支持的XSLT 3.0（可在Maven和Sourceforge上使用），您可以按如下方式使用XSLT：

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:math="http://www.w3.org/2005/xpath-functions/math"
    exclude-result-prefixes="xs math"
    version="3.0">

    <xsl:output cdata-section-elements="Author"/>

    <xsl:mode on-no-match="shallow-copy"/>

    <xsl:template match="Author">
        <xsl:copy>
            <xsl:apply-templates select="@*"/>
            <xsl:value-of select="serialize(node())"/>
        </xsl:copy>
    </xsl:template>

</xsl:stylesheet>

至于你的尝试，你基本上需要“实现”在XSLT 3.0中简洁地编写的身份转换模板<xsl:mode on-no-match="shallow-copy"/>作为模板

<xsl:template match="@* | node()">
  <xsl:copy>
    <xsl:apply-templates select="@* | node()"/>
  </xsl:copy>
</xsl:template>

在XSLT 1.0中，以便那些不由更专业的模板处理的节点（如Author元素的那个）以递归方式复制。

然后，通过选择所有子节点node()的副本，而不仅仅是您获得的元素节点*

<xsl:template match="Author">
<xsl:copy>
<xsl:apply-templates select="@*"/>
<xsl:text disable-output-escaping="yes">&lt;![CDATA[</xsl:text>
<xsl:copy-of select="node()"/>    
<xsl:text disable-output-escaping="yes">]]&gt;</xsl:text>
</xsl:copy>
</xsl:template>

Answer 2

不是使用像Jsoup这样简单的html / xml解析器来解决这个问题的更好方法吗？使用Jsoup，你可以尝试这样的事情：

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.parser.Parser;
import org.jsoup.select.Elements;

public class Example {

    public static void main(String[] args) {
        String xml = "<?xml version=\"1.0\"?>\n"
                + "<Book>\n"
                + "  <Title>Ulysses</Title>\n"
                + "  <Author>James <b>Joyce</b></Author>\n"
                + "</Book>";
        Document doc = Jsoup.parse(xml, "", Parser.xmlParser());
        doc.outputSettings().prettyPrint(false);
        Elements books = doc.select("Book");
        for(Element e: books){
            Book b = new Book(e.select("Title").html(),e.select("Author").html());
            System.out.println(b.title);
            System.out.println(b.author);
        }
    }
    public static class Book{
        String title;
        String author;

        public Book(String title, String author) {
            this.title = title;
            this.author = author;
        }        
    }
}

需要使用html元素

2 个答案: