Question

documentation of the Document interface将界面描述为：

Document接口表示整个HTML或XML文档。

javax.xml.parsers.DocumentBuilder构建XML Document。但是，我无法找到构建HTML Document Document的方法！

我想要一个HTML Document，因为我正在尝试构建一个文档，然后将其传递给期望HTML Document的库。该库以非区分大小写的方式使用Document#getElementsByTagName(String tagname)，这适用于HTML，但不适用于XML。

我环顾四周，找不到任何东西。像How to convert an Html source of a webpage into org.w3c.dom.Document in java?这样的项目实际上没有答案。

Answer 1

您似乎有两个明确的要求：

您需要将HTML表示为org.w3c.dom.Document。
您需要Document#getElementsByTagName(String tagname)以不区分大小写的方式运作。

如果您尝试使用org.w3c.dom.Document处理HTML，那么我假设您正在使用某种XHTML。因为XML API（例如DOM）需要格式良好的XML。 HTML不一定是格式良好的XML，但XHTML是格式良好的XML。即使您使用HTML，在尝试通过XML解析器运行之前，您还必须进行一些预处理以确保它是格式良好的XML。使用HTML解析器（例如jsoup）首先解析HTML可能更容易，然后通过浏览HTML解析器生成的树（org.w3c.dom.Document来构建org.jsoup.nodes.Document。在jsoup的情况下。

有一个org.w3c.dom.html.HTMLDocument接口，扩展了org.w3c.dom.Document。我找到的唯一实现是Xerces-j（2.11.0），格式为org.apache.html.dom.HTMLDocumentImpl。起初这似乎很有希望，但仔细研究后，我们发现存在一些问题。

<强> 1。没有一个清晰的，干净的＆＃34;获取实现org.w3c.dom.html.HTMLDocument接口的对象实例的方法。

使用Xerces，我们通常会以下列方式使用Document获取DocumentBuilder个对象：

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.newDocument();
//or doc = builder.parse(xmlFile) if parsing from a file

或使用DOMImplementation种类：

DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();
DOMImplementationLS impl = (DOMImplementationLS)registry.getDOMImplementation("LS");
LSParser lsParser = impl.createLSParser(DOMImplementationLS.MODE_SYNCHRONOUS, null);
Document document = lsParser.parseURI("myFile.xml");

在这两种情况下，我们纯粹使用org.w3c.dom.*接口来获取Document对象。

我为HTMLDocument找到的最接近的等价物是这样的：

HTMLDOMImplementation htmlDocImpl = HTMLDOMImplementationImpl.getHTMLDOMImplementation();
HTMLDocument htmlDoc = htmlDocImpl.createHTMLDocument("My Title");

这要求我们直接实例化内部实现类，使我们的实现依赖于Xerces。

（注意：我还看到Xerces还有一个内部HTMLBuilder（实现了已弃用的DocumentHandler），可以生成HTMLDocument using a SAX parser, but I didn't bother looking into it.）

<强> 2。 org.w3c.dom.html.HTMLDocument无法生成正确的XHTML。

虽然，您可以使用HTMLDocument以不区分大小写的方式搜索getElementsByTagName(String tagname)树，但所有元素名称都内部保存在 ALL CAPS中。但是XHTML元素和属性名称应该在all lowercase中。（这可以通过遍历整个文档树并使用Document的{{1}}方法将所有元素的名称更改为小写来解决。）

此外，XHTML文档应该具有适当的DOCTYPE declaration和xmlns declaration for the XHTML namespace 。似乎没有一种简单的方法可以在renameNode()中设置它们（除非你做一些摆弄内部Xerces实现）。

第3。 HTMLDocument几乎没有文档，界面的Xerces实现似乎不完整。

我没有搜索整个互联网，但我找到org.w3c.dom.html.HTMLDocument的唯一文档是先前链接的JavaDocs，以及Xerces内部实现的源代码中的注释。在这些评论中，我还发现说明界面的几个不同部分没有实现。 （旁注：我的印象是HTMLDocument界面本身并没有被任何人真正使用过，也许本身就不完整。）

出于这些原因，我认为最好避免使用org.w3c.dom.html.HTMLDocument并尽可能地使用org.w3c.dom.html.HTMLDocument。我们能做什么？

一种方法是扩展org.w3c.dom.Document（扩展org.apache.xerces.dom.DocumentImpl实现org.apache.xerces.dom.CoreDocumentImpl）。这种方法并不需要太多代码，但由于我们正在扩展org.w3c.dom.Document，它仍然使我们的实现依赖于Xerces。在我们的DocumentImpl中，我们只是在元素创建和搜索时将所有标记名称转换为小写。这将允许以不区分大小写的方式使用MyHTMLDocumentImpl。

Document#getElementsByTagName(String tagname)：

MyHTMLDocumentImpl

测试仪：

import org.apache.xerces.dom.DocumentImpl;
import org.apache.xerces.dom.DocumentTypeImpl;
import org.w3c.dom.DOMException;
import org.w3c.dom.Document;
import org.w3c.dom.DocumentType;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;

//a base class somewhere in the hierarchy implements org.w3c.dom.Document
public class MyHTMLDocumentImpl extends DocumentImpl {

    private static final long serialVersionUID = 1658286253541962623L;


    /**
     * Creates an Document with basic elements required to meet
     * the <a href="http://www.w3.org/TR/xhtml1/#strict">XHTML standards</a>.
     * <pre>
     * {@code
     * <?xml version="1.0" encoding="UTF-8"?>
     * <!DOCTYPE html 
     *     PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
     *     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
     * <html xmlns="http://www.w3.org/1999/xhtml">
     *     <head>
     *         <title>My Title</title>
     *     </head>
     *     <body/>
     * </html>
     * }
     * </pre>
     * 
     * @param title desired text content for title tag. If null, no text will be added.
     * @return basic HTML Document. 
     */
    public static Document makeBasicHtmlDoc(String title) {
        Document htmlDoc = new MyHTMLDocumentImpl();
        DocumentType docType = new DocumentTypeImpl(null, "html",
                "-//W3C//DTD XHTML 1.0 Strict//EN",
                "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd");
        htmlDoc.appendChild(docType);
        Element htmlElement = htmlDoc.createElementNS("http://www.w3.org/1999/xhtml", "html");
        htmlDoc.appendChild(htmlElement);
        Element headElement = htmlDoc.createElement("head");
        htmlElement.appendChild(headElement);
        Element titleElement = htmlDoc.createElement("title");
        if(title != null)
            titleElement.setTextContent(title);
        headElement.appendChild(titleElement);
        Element bodyElement = htmlDoc.createElement("body");
        htmlElement.appendChild(bodyElement);

        return htmlDoc;
    }

    /**
     * This method will allow us to create a our
     * MyHTMLDocumentImpl from an existing Document.
     */
    public static Document createFrom(Document doc) {
        Document htmlDoc = new MyHTMLDocumentImpl();
        DocumentType originDocType = doc.getDoctype();
        if(originDocType != null) {
            DocumentType docType = new DocumentTypeImpl(null, originDocType.getName(),
                    originDocType.getPublicId(),
                    originDocType.getSystemId());
            htmlDoc.appendChild(docType);
        }
        Node docElement = doc.getDocumentElement();
        if(docElement != null) {
            Node copiedDocElement = docElement.cloneNode(true);
            htmlDoc.adoptNode(copiedDocElement);
            htmlDoc.appendChild(copiedDocElement);
        }
        return htmlDoc;
    }

    private MyHTMLDocumentImpl() {
        super();
    }

    @Override
    public Element createElement(String tagName) throws DOMException {
        return super.createElement(tagName.toLowerCase());
    }

    @Override
    public Element createElementNS(String namespaceURI, String qualifiedName) throws DOMException {
        return super.createElementNS(namespaceURI, qualifiedName.toLowerCase());
    }

    @Override
    public NodeList getElementsByTagName(String tagname) {
        return super.getElementsByTagName(tagname.toLowerCase());
    }

    @Override
    public NodeList getElementsByTagNameNS(String namespaceURI, String localName) {
        return super.getElementsByTagNameNS(namespaceURI, localName.toLowerCase());
    }

    @Override
    public Node renameNode(Node n, String namespaceURI, String qualifiedName) throws DOMException {
        return super.renameNode(n, namespaceURI, qualifiedName.toLowerCase());
    }
}

输出：

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;

import org.w3c.dom.DOMConfiguration;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.NodeList;
import org.w3c.dom.bootstrap.DOMImplementationRegistry;
import org.w3c.dom.ls.DOMImplementationLS;
import org.w3c.dom.ls.LSOutput;
import org.w3c.dom.ls.LSSerializer;


public class HTMLDocumentTest {

    private final static int P_ELEMENT_NUM = 3;

    public static void main(String[] args) //I'm throwing all my exceptions here to shorten the example, but obviously you should handle them appropriately.
            throws ClassNotFoundException, InstantiationException, IllegalAccessException, ClassCastException, IOException {

        Document htmlDoc = MyHTMLDocumentImpl.makeBasicHtmlDoc("My Title");

        //populate the html doc with some example content
        Element bodyElement = (Element) htmlDoc.getElementsByTagName("body").item(0);
        for(int i = 0; i < P_ELEMENT_NUM; ++i) {
            Element pElement = htmlDoc.createElement("p");
            String id = Integer.toString(i+1);
            pElement.setAttribute("id", "anId"+id);
            pElement.setTextContent("Here is some text"+id+".");
            bodyElement.appendChild(pElement);
        }

        //get the title element in a case insensitive manner.
        NodeList titleNodeList = htmlDoc.getElementsByTagName("tItLe");
        for(int i = 0; i < titleNodeList.getLength(); ++i)
            System.out.println(titleNodeList.item(i).getTextContent());

        System.out.println();

        {//get all p elements searching with lowercase
            NodeList pNodeList = htmlDoc.getElementsByTagName("p");
            for(int i = 0; i < pNodeList.getLength(); ++i) {
                System.out.println(pNodeList.item(i).getTextContent());
            }
        }

        System.out.println();

        {//get all p elements searching with uppercase
            NodeList pNodeList = htmlDoc.getElementsByTagName("P");
            for(int i = 0; i < pNodeList.getLength(); ++i) {
                System.out.println(pNodeList.item(i).getTextContent());
            }
        }

        System.out.println();

        //to serialize
        DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();
        DOMImplementationLS domImplLS = (DOMImplementationLS) registry.getDOMImplementation("LS");

        LSSerializer lsSerializer = domImplLS.createLSSerializer();
        DOMConfiguration domConfig = lsSerializer.getDomConfig();
        domConfig.setParameter("format-pretty-print", true);  //if you want it pretty and indented

        LSOutput lsOutput = domImplLS.createLSOutput();
        lsOutput.setEncoding("UTF-8");

        //to write to file
        try (OutputStream os = new FileOutputStream(new File("myFile.html"))) {
            lsOutput.setByteStream(os);
            lsSerializer.write(htmlDoc, lsOutput);
        }

        //to print to screen
        System.out.println(lsSerializer.writeToString(htmlDoc)); 
    }

}

与上述类似的另一种方法是改为创建一个My Title Here is some text1. Here is some text2. Here is some text3. Here is some text1. Here is some text2. Here is some text3. <?xml version="1.0" encoding="UTF-8"?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>My Title</title> </head> <body> Here is some text1. Here is some text2. Here is some text3. </body> </html>包装器，它包装Document对象并实现Document接口本身。这需要的代码多于＆＃34;扩展Document＆＃34;方法，但这种方式更清洁＆＃34;因为我们不必关心特定的DocumentImpl实施。这种方法的额外代码并不困难;为Document方法提供所有这些包装器实现只是有点乏味。我还没有完全解决这个问题，可能会有一些问题，但如果有效，这就是一般性的想法：

Document

无论是public class MyHTMLDocumentWrapper implements Document { private Document doc; public MyHTMLDocumentWrapper(Document doc) { //... this.doc = doc; //... } //... }，我上面提到的方法之一，还是别的什么，也许这些建议可以帮助您了解如何继续。

修改

在我尝试解析以下XHTML文件时的解析测试中，Xerces会在尝试打开http连接的实体管理类中挂起。为什么我不知道？特别是因为我在没有实体的本地html文件上测试过。（也许与DOCTYPE或命名空间有关？）这是文档：

org.w3c.dom.html.HTMLDocument

如何构建HTML org.w3c.dom.Document？

1 个答案: