更新

Question

我有一个要解析的XML，如下所示

<feed>
    <feed_id>12941450184d2315fa63d6358242</feed_id>
    <content> <fieldset><table cellpadding='0'  border='0'  cellspacing='0'  style="clear :both"><tr valign='top' ><td width='35' ><a href='http://mypage.rediff.com/android/32868898'  class='space' onmousedown="return enc(this,'http://track.rediff.com/click?url=___http%3A%2F%2Fmypage.rediff.com%2Fandroid%2F32868898___&service=mypage_feeds&clientip=202.137.232.113&pos=0&feed_id=12941450184d2315fa63d6358242&prc_id=32868898&rowid=674061088')" ><div style='width:25px;height:25px;overflow:hidden;'><img src='http://socialimg04.rediff.com/image.php?uid=32868898&type=thumb'  width='25'  vspace='0'  /></div></a></td> <td><span><a href='http://mypage.rediff.com/android/32868898'  class="space" onmousedown="return enc(this,'http://track.rediff.com/click?url=___http%3A%2F%2Fmypage.rediff.com%2Fandroid%2F32868898___&service=mypage_feeds&clientip=202.137.232.113&pos=0&feed_id=12941450184d2315fa63d6358242&prc_id=32868898&rowid=674061088')" >Android </a> </span><span style='color:#000000 !important;'>testing</span><div class='divtext'></div></td></tr><tr><td height='5' ></td></tr></table></fieldset><br/></content>
    <action>status updated</action>
</feed>

标签包含HTML内容，其中包含我需要的数据。我正在使用SAX Parser。这就是我在做什么

private Timeline timeLine; //Object
private String tempStr;

public void characters(char[] ch, int start, int length)
        throws SAXException {
    tempStr = new String(ch, start, length);
}

public void endElement(String uri, String localName, String qName)
        throws SAXException {
    if (localName.equalsIgnoreCase("content")) {
        if (timeLine != null) {
            timeLine.setContent(tempStr);
        }
}

这个逻辑会起作用吗？如果不是，我应该如何使用SAX Parser从XML中提取嵌入的HTML数据。

Answer 1

你可以在所有html也是xml之后解析html。在stackoverflow中有一个与此类似的链接。你可以尝试这个How to parse the html content in android using SAX PARSER

Answer 2

在启动元素上，如果元素是内容，则应初始化临时Str缓冲区。否则如果内容已经开始，捕获当前的start元素及其属性，并将其更新为temp Str缓冲区。

关于字符，如果内容已启动，请将charecters添加到当前字符串缓冲区。

结束元素 如果内容已启动，请捕获结束节点并添加到字符串缓冲区。

我的假设：

xml只有一个内容标记。

Answer 3

如果html实际上是xhtml，你可以使用SAX解析它并提取<content>标签的xhtml内容，但不是很容易。

您必须让您的处理程序实际响应<content>标记内的所有xhtml标记引发的事件，并构建类似于DOM结构的东西，然后您可以将其序列化为xml表单，或者在运行中直接写入复制内容的xml字符串缓冲区。

如果你修改你的xml，以便内容标记中的html被包装在How to parse the html content in android using SAX PARSER中建议的CDATA元素中，那么离你的代码不太远的东西确实应该有效。

但是，您不能像以前那样将内容放入String tempStr方法中的characters变量中。您需要有一个startElement方法，在查看<content>标记时初始化字符串的缓冲区，在characters方法中收集到该缓冲区，并将结果放在endElement代码<content>。

Answer 4

我以这种方式找到解决方案：

注意：在此解决方案中，我想获取<chapter>代码(<chapter> ... html content ... </chapter>)

之间的html内容

DefaultHandler handler = new DefaultHandler() {

    boolean chap = false;

    public char[] temp;
    int chapterStart;
    int chapterEnd;

    public void startElement(String uri, String localName,
            String qName, Attributes attributes)
            throws SAXException {

            System.out.println("Start Element :" + qName);

            if (qName.equalsIgnoreCase("chapter")) {
                chap = true;
            }

        }

        public void endElement(String uri, String localName,
            String qName) throws SAXException {

            if (qName.equalsIgnoreCase("chapter")) {
                System.out.println(new String(temp, chapterStart, chapterEnd-chapterStart));

            }
            System.out.println("End Element :" + qName);

        }

        public void characters(char ch[], int start, int length)
                throws SAXException {

            if (chap) {
                temp = ch;
                chapterStart = start;
                chap = false;
            }
            chapterEnd = start + length;

        }

    };

更新

我的代码有错误。因为DocumentHandler中ch []的长度在不同情况下有所不同！

SAX Parser：从XML检索HTML标记

4 个答案:

更新