Question

我正在尝试用NekoHTML解析一个简单的HTML片段：

<h1>This is a basic test</h1>

为此，我设置了一个specific Neko feature，没有任何HTML，HEAD或BODY标记调用startElement（..）回调。

不幸的是，它对我不起作用..我当然错过了一些东西，但无法弄清楚它会是什么。

这是一个非常简单的代码来重现我的问题：

 public static class MyContentHandler implements ContentHandler {

     public void characters(char[] ch, int start, int length) throws SAXException {
         String text = String.valueOf(ch, start, length);
         System.out.println(text);
     }

     public void startElement(String nameSpaceURI, String localName, String rawName, Attributes attributes) throws SAXException {
         System.out.println(rawName);
     }

     public void endElement(String nameSpaceURI, String localName, String rawName) throws SAXException {
         System.out.println("end " + localName);
     }
 }

启动测试的main（）：

  public static void main(String[] args) throws SAXException, IOException {
       SAXParser saxReader = new SAXParser();
       // set the feature like explained in documentation : http://nekohtml.sourceforge.net/faq.html#fragments
       saxReader.setFeature("http://cyberneko.org/html/features/balance-tags/document-fragment", true);
       saxReader.setContentHandler(new MyContentHandler());
       saxReader.parse(new InputSource(new StringInputStream("<h1>This is a basic test</h1>")));
  }

相应的输出：

HTML
HEAD
end HEAD
BODY
H1
This is a basic test
end H1
end BODY
end HTML

而我期待

H1
This is a basic test
end H1

有什么想法吗？

Answer 1

我终于明白了！

实际上，我正在GWT应用程序中解析我的HTML字符串，在那里我添加了gwt-dev.jar依赖项。这个jar包装了很多外部库，比如xercesImpl。但嵌入式xerces类的版本与NeokHTML所需的版本不匹配。

作为一个（奇怪的）结果，看来NeokHTML SAX解析器在使用gwt-dev嵌入式xerces版本时没有使用任何自定义功能。

所以，我不得不重做一些代码来删除gwt-dev依赖，顺便说一下，不建议将其添加到任何标准的GWT项目中。

NekoHTML SAX片段解析

1 个答案: