ContentHandler以错误的顺序进行回调

时间:2014-02-13 11:28:46

标签: android html css saxparser

我正在使用ContentHandler来解析带有CSS样式的自定义html。 问题是 - 当我尝试使用UL标记解析HTML时,ContentHandler会出错。 它会调用startTag()然后endTag()然后调用characters()

这是我的HTML

<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
<style>ul.ul1{list-style-type:image;}
</style>
</head>
<body>
<ul class="ul1">List</ul>
<ul class="ul2">List</ul>
</body>
</html>

以下是测试解析器的示例代码

public class ContentHandler implements org.xml.sax.ContentHandler {
    public ContentHandler() {
    }

    public Spanned getResult() {
    }

    @Override
    public void setDocumentLocator(Locator locator) {
    }

    @Override
    public void startDocument() throws SAXException {
    }

    @Override
    public void endDocument() throws SAXException {
    }

    @Override
    public void startPrefixMapping(String prefix, String uri) throws SAXException {
    }

    @Override
    public void endPrefixMapping(String prefix) throws SAXException {
    }

    @Override
    public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException {
        Log.d("html_parser", "start " + localName);
    }

    @Override
    public void endElement(String uri, String localName, String qName) throws SAXException {
        Log.d("html_parser", "end " + localName);
    }

    @Override
    public void characters(char[] ch, int start, int length) throws SAXException {
        String bodyText = new String(ch, start, length);
        Log.d("html_parser", bodyText);
    }

    @Override
    public void ignorableWhitespace(char[] ch, int start, int length) throws SAXException {
    }

    @Override
    public void processingInstruction(String target, String data) throws SAXException {
    }

    @Override
    public void skippedEntity(String name) throws SAXException {
    }
}

和LogCat输出

02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ start html
02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ start head
02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ start meta
02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ end meta
02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ start style
02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ ul.ul1{list-style-type:image;}
02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ end style
02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ end head
02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ start body
02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ start ul
02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ end ul
02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ List
02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ start ul
02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ end ul
02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ List
02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ end body
02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ end html

请注意,当我解析没有UL标记的HTML时,它可以正常工作。另请注意,对于解析使用org.ccil.cowan.tagsoup.jaxp.SAXParserImpl。

1 个答案:

答案 0 :(得分:2)

我已经测试了你的问题并发现了一些有趣的事实。你已经使用SAX解析器来解析html,所以html与xml有很多不同。例如,有时标签可以是未封闭的等等。所以org.ccil.cowan.tagsoup.jaxp.SAXParserImpl允许我们解析html。此解析器还包含一些附加标记https://github.com/websdotcom/tagsoup#what-tagsoup-does。在下一个代码中查找html。如果你要添加正确处理的内容的正确结构。所以我认为这就像TagSoup lib中的bug一样。

import android.test.AndroidTestCase;
import android.util.Log;

import org.ccil.cowan.tagsoup.jaxp.SAXParserImpl;
import org.xml.sax.Attributes;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;

import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;

import javax.xml.parsers.SAXParser;

/**
 * Created by kulik on 1/5/14.
 */
public class SaxTest extends AndroidTestCase {
    private static final String TAG = "SaxTest";

    public void testSax() {
        String testString = "<!DOCTYPE html>\n" +
                "<html>\n" +
                "<head>\n" +
               "<META http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">\n" +
                "<style>ul.ul1{list-style-type:image;}\n" +
                "</style>\n" +
                "</head>\n" +
                "<body>\n" +
                "<ul class=\"ul1\">List</ul>\n" +
                "<ul class=\"ul2\">" +
                "<li> li1</li>\n" +
                "<li> li2</li>\n" +
                "</ul>" +
                "</body>\n" +
                "</html>";

        Reader reader = new StringReader(testString);
        try {
            SAXParser sp = SAXParserImpl.newInstance(null);
            XMLReader xr = sp.getXMLReader();

            DefaultHandler myHandler = new ContentHandler();
            xr.setContentHandler(myHandler);
            xr.parse(new InputSource(reader));
        } catch (SAXException e) {
            Log.e(TAG, "", e);
        } catch (IOException e) {
            Log.e(TAG, "", e);
        }
    }

    public class ContentHandler extends DefaultHandler  {

        @Override
        public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException {
            Log.d("html_parser", "start " + localName);
        }

        @Override
        public void endElement(String uri, String localName, String qName) throws SAXException {
            Log.d("html_parser", "end " + localName);
        }

        @Override
        public void characters(char[] ch, int start, int length) throws SAXException {
            String bodyText = new String(ch, start, length);
            Log.d("html_parser", bodyText);
        }
    }
}

并记录

  D/html_parser﹕ start html
  D/html_parser﹕ start head
  D/html_parser﹕ start meta
  D/html_parser﹕ end meta
  D/html_parser﹕ start style
  D/html_parser﹕ ul.ul1{list-style-type:image;}
  D/html_parser﹕ end style
  D/html_parser﹕ end head
  D/html_parser﹕ start body
  D/html_parser﹕ start ul
  D/html_parser﹕ end ul
  D/html_parser﹕ List
  D/html_parser﹕ start ul
  D/html_parser﹕ start li
  D/html_parser﹕ li1
  D/html_parser﹕ end li
  D/html_parser﹕ start li
  D/html_parser﹕ li2
  D/html_parser﹕ end li
  D/html_parser﹕ end ul
  D/html_parser﹕ end body
  D/html_parser﹕ end html

因此,您可以实现处理程序来捕获这种情况,因为我认为这只与没有任何

  • 的标记相关联。也许它出现是因为:

      

    TagSoup的语义与实际HTML的语义一样实用   浏览器。特别是,从来没有,它永远不会抛出任何类型的语法   错误:TagSoup的座右铭是“Just Keep On Truckin”。但是有很多,   多得多。例如,如果第一个标签是LI,它将提供   包含HTML,BODY和UL标记的应用程序。为何选择UL?因为   这就是浏览器在这种情况下所假设的。出于同样的原因,   重叠的标签正确地重新启动了.......

    http://home.ccil.org/~cowan/XML/tagsoup/

    你也可以问问候团队。