Question

此代码已使用Android进行测试：

    public static void test() {
        String text="<html>" +
        "<head><meta http-equiv=\"Content-Type\" content=\"text/html; charset=us-ascii\" />" +
        "<title>Testing Escapes with DOM</title>" +
        "</head>" +
        "<body lang=\"en\"><p>This is an escape: &mdash;</p></body>" +
        "</html>";

        try {
            DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
            Document inputDoc = builder.parse(new ByteArrayInputStream(text.getBytes()));

            Transformer transformer = TransformerFactory.newInstance().newTransformer();
            ByteArrayOutputStream baos = new ByteArrayOutputStream();
            transformer.transform(new DOMSource(inputDoc), new StreamResult(baos));

            System.out.println("Result: " + baos.toString());

        } catch (Exception ex) {
            ex.printStackTrace();
        }
    }

这是输出：

Result: <html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Testing Escapes with DOM</title>
</head>
<body lang="en">
<p>This is an escape: </p>
</body>
</html>

（实际代码并不只是将输入复制到输出，但有一些过滤）

问题是关于＆amp; mdash; ，它出现在原始文本中，但没有出现在输出中。

当我查看解析后创建的文档时，它实际上有一个EntityReference节点用于＆amp; mdash; 但看起来DOMSource希望解析任何实体并以其他方式跳过它。

与XML不同，HTML不接受ENTITY，因此预定义的实体是唯一可识别的实体。出于这个原因，我只想让所有实体出现在输出中＃34;按原样＃34;没有得到解决。有没有办法做到这一点？（也许是DOMSource + Transformer的替代品？）

当然我可以用实际的utf字符替换所有转义符，这肯定会有效。但我的文本有很多逃脱，替换所有这些将是一项繁琐的工作。除此之外我想一次又一次地找到解决方案。

Answer 1

这是一种可行的解决方法：在解析之前用“哑序列”（例如“###”）替换所有＆amp; -s，并且在完成之后以相反的方式替换。（顺便说一句，这适用于包含“真正”＆符号的HTML，表示为＆amp; amp;

使用这种方法，上面的例子变成了以下

public static void test() {
   String text="<html>" +
   "<head><meta http-equiv=\"Content-Type\" content=\"text/html; charset=us-ascii\" />" +
   "<title>Testing Escapes with DOM</title>" +
   "</head>" +
   "<body lang=\"en\"><p>This is an escape: &mdash;</p></body>" +
   "</html>";

   try {
       DocumentBuilder builder = 
                DocumentBuilderFactory.newInstance().newDocumentBuilder();
       Document inputDoc = builder.parse(new ByteArrayInputStream(
                text.replace("&", "###").getBytes()));

       Transformer transformer = TransformerFactory.newInstance().newTransformer();
       ByteArrayOutputStream baos = new ByteArrayOutputStream();
       transformer.transform(new DOMSource(inputDoc), new StreamResult(baos));

       System.out.println("Result: " + baos.toString().replace("###", "&"));

       } catch (Exception ex) {
            ex.printStackTrace();
       }
 }

一个非常尴尬的解决方案和巨大的内存浪费（这对于如此糟糕的Dalvik垃圾收集引擎来说真是一个挑战！），我肯定会喜欢不那么零散的东西......

转换为字符串后丢失HTML转义

1 个答案: