Question

我正在使用Dom4j来解析HTML文档。 Dom4j需要XML，因此不会声明HTML实体。可以在文档的DTD中声明它们，但我正在解析外部输入，所以这是不合适的。我宁愿在解析器中以编程方式声明它们。

这是我的代码：

    // Read.
    final DocumentFactory df = DOMDocumentFactory.getInstance();
    SAXReader reader = new SAXReader();
    Document doc, outDoc;
    try {
        doc = reader.read( new StringReader(htmlStr) );
    }
    catch( Exception ex ){
        throw new RuntimeException("Error parsing the HTML:\n       " + ex.toString() );
    }

我看到SAXReader有reader.setEntityResolver( ??? );但似乎不是解决办法，因为可覆盖的方法看起来像这样：

public InputSource resolveEntity(String publicId, String systemId) throws SAXException, IOException

我在寻找什么就像是

reader.setTrueEntityResolver( new EntityResolver(){
    public InputStream resolve( String name ){ ... }
}

Answer 1

我在http://evc-cit.info/dom4j/dom4j_groovy.html找到了一个可能的解决方案建议添加XML Commons Catalog的东西。

然而，这似乎是一种矫枉过正，因为无论如何都没有指定doctype，我只打算解析公共HTML 4实体。

更新：原来没有明确的DOCTYPE声明，这没有任何效果 - 永远不会调用EntityResolver。

Maven dep：

    <dependency>
        <groupId>xml-resolver</groupId>
        <artifactId>xml-resolver</artifactId>
        <version>1.2</version>
        <scope>test</scope>
    </dependency>

在类路径上的/CatalogManager.proeprties中配置：

# allow location to be relative to this file's directory
relative-catalogs=yes

# A semicolon-delimited list of catalog files.
# In this instance, we have a single catalog file, and it's a relative path name
catalogs=sgml-lib/xml.soc

# no debugging messages, please
verbosity=0

# Use the SYSTEM identifier 
prefer=system

告诉解析器在遇到DTD时使用目录解析器：

cResolver = new CatalogResolver( cMgr )
reader = new SAXReader( )
reader.setEntityResolver( cResolver )

Answer 2

好吧，正如你所说，DOM4J并不是要解析HTML。我宁愿使用tagsoup或HTML Cleaner之类的东西。它不是实体，HTML不是XML。

Dom4j解析 - 如何以编程方式声明HTML实体？ “该实体被引用，但没有被宣布。”

2 个答案: