声明ENTITY将nbsp定义为字符串“”

时间:2016-05-04 16:01:25

标签: java html xml xslt

我有一个我需要通过XSL转换的HTML文档。 HTML文档包含 的用法 即,

ation.</span>&nbsp;</p><br/>All ...

首先我遇到了麻烦,因为没有定义。 所以我定义了它:

<?xml version=\"1.0\"?>
<!DOCTYPE html [
    <!ENTITY nbsp "&#160;">
"]>

我是通过在将代码发送到转换之前将该代码添加到HTML字符串中来实现的。在转换之后,ENTITY声明很方便,并且,是的,很好,转换实际上已成功。

然而!由于nbsp被定义为空格,因此生成的HTML / XML看到字符串"&nbsp;"实际上被空格字符替换。

这不是我想要的。我需要结果的一部分与源不同。

所以,我尝试重新定义,就像这样:

<?xml version=\"1.0\"?>
<!DOCTYPE html [
    <!ENTITY nbsp "&amp;nbsp;">
"]>

但是,现在我没有看到结果中的空格,而是看到了字符"&amp;nbsp;"

如果我试试这个:

<?xml version=\"1.0\"?>
<!DOCTYPE html [
    <!ENTITY nbsp "&nbsp;">
"]>

我得到一个递归声明异常。

我如何包含特殊字符'&amp;'在定义?

p.s。,这个转换我在Java 8中运行,默认引擎(我猜那是xalan?)。

全部谢谢!

以下是如何重现的简短示例。很抱歉没有提前提供。

package com.astraia.app.mainframe;

import java.io.*;
import javax.xml.transform.*;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.stream.StreamSource;

public class ShortExample
{
    public static void main(String[] args)
    {
        StringBuffer htmlMain = new StringBuffer(500);
        htmlMain    .append("<html><head></head>")
                    .append("   <body>)")
                    .append("       <p data-tags=\"personal\"><strong>name: Nerea Morry,  Id: 5678</strong><br/></p>")
                    .append("       <p><span>some text</span>&nbsp;</p><br/>some more text")
                    .append("   </body>")
                    .append("</html>");

        StringBuffer xsl = new StringBuffer(500);
        xsl .append("<?xml version=\"1.0\" encoding=\"UTF-8\"?>")
            .append("<xsl:stylesheet xmlns:xsl=\"http://www.w3.org/1999/XSL/Transform\" version=\"1.0\">")
            .append("   <xsl:output method=\"xml\" version=\"1.0\" encoding=\"UTF-8\" omit-xml-declaration=\"yes\" />")
            .append("   <xsl:template match=\"node()|@*\" >")
            .append("       <!-- Copy all nodes -->")
            .append("       <xsl:copy>")
            .append("             <xsl:apply-templates select=\"node()|@*\" />")
            .append("       </xsl:copy>")
            .append("   </xsl:template>")
            .append("   <!-- Anonymize all text within tags indicated as personal -->")
            .append("   <xsl:template match=\"*[@data-tags = 'personal' ]//text()[normalize-space(.) != '']\">ANONYMIZED TEXT</xsl:template>")
            .append("   </xsl:stylesheet>");

        String plainHtml = htmlMain.toString();
        String transformation = xsl.toString();

        // results in &nbsp being replaced by a space
        printResult("results in &nbsp being replaced by a space", plainHtml,"&#160;", transformation);
        // results in seemingly non-replaced escape code &amp;
        printResult("results in seemingly non-replaced escape code &amp;", plainHtml,"&amp;nbsp", transformation);
        // results in recursion exception
        printResult("results in recursion exception", plainHtml,"&nbsp;", transformation);
        // also results in recursion exception
        printResult("also results in recursion exception", plainHtml,"&#038;nbsp;", transformation);

        // but what will result in:
        // <html><head/>    <body>)     <p data-tags="personal"><strong>ANONYMIZED TEXT</strong><br/></p>       <p><span>some text</span>&nbsp</p><br/>some more text   </body></html>
        // ?
    }

    public static void printResult(String message, String plainHtml, String definition, String transformation) {
        System.out.print(message);
        System.out.println(performTransformation(plainHtml,definition, transformation));
        System.out.println("\n-----");
    }

    public static String performTransformation(String plainHtml, String definition, String transformation)
    {
        String retval = null;

        try {
            StringWriter result = new StringWriter();
            StringBuffer header = new StringBuffer(100);
            header  .append("<?xml version=\"1.0\"?>")
                    .append("<!DOCTYPE html [")
                    .append("    <!ENTITY nbsp REPLACE_ME>")
                    .append("]>\n");

            String headerText = header.toString().replace("REPLACE_ME", "\"" + definition + "\"");
            String wholeText = new StringBuffer(headerText).append(plainHtml).toString();

            TransformerFactory factory = TransformerFactory.newInstance();
            Source xslt = new StreamSource(new StringReader(transformation));
            Transformer transformer = factory.newTransformer(xslt);
            Source text = new StreamSource(new StringReader(wholeText));
            transformer.transform(text, new StreamResult(result));
            retval = result.toString();
        }
        catch (Exception e) {
            System.out.println(e.getMessage());
        }

        return retval;
    }
}

以下是我的小样本应用程序的输出:

results in &nbsp being replaced by a space<html><head/> <body>)     <p data-tags="personal"><strong>ANONYMIZED TEXT</strong><br/></p>       <p><span>some text</span> </p><br/>some more text   </body></html>

-----
results in seemingly non-replaced escape code &amp;<html><head/>    <body>)     <p data-tags="personal"><strong>ANONYMIZED TEXT</strong><br/></p>       <p><span>some text</span>&amp;nbsp</p><br/>some more text   </body></html>

-----
results in recursion exceptionjavax.xml.transform.TransformerException: com.sun.org.apache.xml.internal.utils.WrappedRuntimeException: Recursive entity reference "nbsp". (Reference path: nbsp -> nbsp -> nbsp),
null
ERROR:  'Recursive entity reference "nbsp". (Reference path: nbsp -> nbsp -> nbsp),'
-----
ERROR:  'com.sun.org.apache.xml.internal.utils.WrappedRuntimeException: Recursive entity reference "nbsp". (Reference path: nbsp -> nbsp -> nbsp),'

also results in recursion exceptionERROR:  'Recursive entity reference "nbsp". (Reference path: nbsp -> nbsp -> nbsp),'
ERROR:  'com.sun.org.apache.xml.internal.utils.WrappedRuntimeException: Recursive entity reference "nbsp". (Reference path: nbsp -> nbsp -> nbsp),'
javax.xml.transform.TransformerException: com.sun.org.apache.xml.internal.utils.WrappedRuntimeException: Recursive entity reference "nbsp". (Reference path: nbsp -> nbsp -> nbsp),
null

-----

4次尝试的区别在于:

</span> </p><br/>some more text

</span>&amp;nbsp</p><br/>some more text

exception

exception

1 个答案:

答案 0 :(得分:1)

我相信你有两个选择:

  1. 将输出方法更改为html;
    这将输出任何不间断的空格&nbsp;

  2. 将输出编码更改为ASCII;
    这将输出任何不间断的空格&#160;

  3. 注意:如果您将输出方法保留为xml且编码保留为UTF-8,则序列化结果仍应包含未转义非破碎的空间。您的处理链中可能还有其他东西可以防止这种情况发生 - 或者您可能将该字符误认为是常规空间(毕竟,在大多数情况下它们的呈现方式相同)。