Question

我有一个XML org.w3c.dom.Document，来自HTML org.jsoup.nodes.Document。

当我序列化org.w3c.dom.Document时，它会生成一个无效的XML文件：它不会关闭META标记。

为什么？这是错误吗？来自jsoup？来自Java org.w3c.dom？来自javax.xml.transform.Transformer？

相关错误：

W3CDom.fromJsoup fails when xmlns is defined https://github.com/jhy/jsoup/issues/1096
org.jsoup.nodes.Document.toString()是否应生成有效的XML文件？ https://github.com/jhy/jsoup/issues/1097

示例代码：

import org.jsoup.Jsoup;
import org.jsoup.helper.W3CDom;
import org.w3c.dom.Document;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import java.io.ByteArrayInputStream;
import java.io.StringWriter;
import java.nio.charset.StandardCharsets;

public class Test130e {
    public static void main(String[] args) throws Exception {
        String html = "<html><head><script async src=\"http://example.com/script.js\"></script></head></html>";

        org.jsoup.nodes.Document jsoupDoc = Jsoup.parse(html); 
        System.out.println("+++ jsoupDoc.toString()");
        System.out.println(jsoupDoc.toString());


        Document w3cDoc = new W3CDom().fromJsoup(jsoupDoc);
        String xml = w3cDocToString(w3cDoc);

        System.out.println("+++ xml");
        System.out.println(xml);

        // this previous xml file is invalid, and so it fails to parse it
        // The element type "META" must be terminated by the matching end-tag "</META>".
        Document w3cDoc2 = parseXml(xml);
    }

    static Document parseXml(String content) throws Exception {
        DocumentBuilder documentBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
        return documentBuilder.parse(new ByteArrayInputStream(content.getBytes(StandardCharsets.UTF_8)));
    }

private static String w3cDocToString(Document w3cDoc) throws TransformerException {
    Transformer transformer = TransformerFactory.newInstance().newTransformer();
    transformer.setOutputProperty(OutputKeys.INDENT, "yes");
    transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "2");
    StreamResult result = new StreamResult(new StringWriter());
    DOMSource source = new DOMSource(w3cDoc);
    transformer.transform(source, result);
    return result.getWriter().toString();
}

}

输出：

+++ jsoupDoc.toString()
<html>
 <head>
  <script async src="http://example.com/script.js"></script>
 </head>
 <body></body>
</html>

+++ xml
<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
<script async="" src="http://example.com/script.js"></script>
</head>
<body></body>
</html>

[Fatal Error] :5:3: The element type "META" must be terminated by the matching end-tag "</META>".
Exception in thread "main" org.xml.sax.SAXParseException; lineNumber: 5; columnNumber: 3; The element type "META" must be terminated by the matching end-tag "</META>".
    at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
    at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
    at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
    at playground.Test130e.parseXml(Test130.java:116)
    at playground.Test130e.main(Test130.java:110)

为什么w3cDocToString会生成无效的XML文件（它不会关闭META标签）？

这是一个错误吗？来自jsoup？来自Java org.w3c.dom？

更新

关于@Alohci的评论：

您是否尝试过将transformer.setOutputProperty(OutputKeys.METHOD, "xml");添加到您的变压器配置中？

有趣！如果我添加它，则变压器的输出将丢弃META标签（它不存在）。为什么？

此外，如果在此之前添加以下行，则表示它已经是“ xml”。太奇怪了！

System.out.println(transformer.getOutputProperty(OutputKeys.METHOD));

为什么w3cDocToString会生成无效的XML文件（它不会关闭META标签）？

这是一个错误吗？来自jsoup？来自Java org.w3c.dom？

Answer 1

这不是org.w3c.dom中的错误，因为它没有呈现XML。

DOM实现既不会忘记也不记得关闭标签，因为它只是结构的内存表示形式（DOM中的 OM 代表 Object Model ）。该模型可以转换为XML，JSON，ProtocolBuffers等，它们都有不同的编码。将其呈现为XML的任何内容都是“忘记”了关闭标签的事情。

您正在使用javax.xml.transform.Transformer抽象类的实现将DOM转换为XML，但是具体类是未知/未指定的。看起来这就是生成错误XML的原因。您可能希望打印出transformer.getClass()来查看实际的实现方式：它取决于环境设置，类路径中的服务提供者等。

注意：我以前从未听说过jsoup。

Answer 2

（根据评论澄清了我的回答；该评论不再与当前形式的回答相关）。

在HTML中，<meta>元素是自动关闭的；没有结束标签。

您已经构建了一个DOM文档，它是一个节点树，其中最上面的节点是HTML元素。

然后，您已使用JAXP序列化程序序列化了DOM文档，而未指定输出方法。默认的输出方法取决于根元素，即HTML，因此您将获得HTML序列化。 HTML序列化程序将未封闭的META标记添加到输出中。

Answer 3

HTML！= XML 有效的XML，但无效的HTML：

<script src="jacoco-resources/sort.js" type="text/javascript"/>

有效的HTML：

<script src="jacoco-resources/sort.js" type="text/javascript"></script>

所以：

        Document template = ....

        //So, we need to generate the HTML format! 
        Transformer t = TransformerFactory.newInstance().newTransformer();

        // for "XHTML" serialization, use the output method "xml"
        // and set publicId as shown
        t.setOutputProperty(OutputKeys.METHOD, "xml");

        t.setOutputProperty(OutputKeys.DOCTYPE_PUBLIC,
                            "-//W3C//DTD XHTML 1.0 Transitional//EN");

        t.setOutputProperty(OutputKeys.DOCTYPE_SYSTEM,
                       "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd");

        // For "HTML" serialization, use
        t.setOutputProperty(OutputKeys.METHOD, "html");
        java.io.Writer writer = new java.io.FileWriter(path + "/code-coverage-total.html");
        // Serialize DOM tree
        t.transform(new DOMSource(template), new StreamResult(writer));

org.w3c.dom.Document打印无效的XML文件

3 个答案: