Android,解析XML,如何忽略HTML标签?

时间:2012-03-12 09:59:06

标签: android xml parsing saxparser

在我的项目中,我需要解析XML。 XML中的某些项目具有HTML标记。我试图删除那些标签,但我没有成功。活动中的代码是:

private NewsFeedItemList parseNewsContent() {
        NewsParserHandler newsParserHandler = null;

        Log.i("NewsList", "Starting to parse XML...");

        try {
            SAXParserFactory factory = SAXParserFactory.newInstance();
            SAXParser parser = factory.newSAXParser();
            XMLReader xr = parser.getXMLReader();
            newsParserHandler = new NewsParserHandler();
            xr.setContentHandler(newsParserHandler);

            ByteArrayInputStream is = new ByteArrayInputStream(strServerResponseMsg.getBytes());
            xr.parse(new InputSource(is));

        } catch (ParserConfigurationException e) {
            e.printStackTrace();
        } catch (SAXException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }

        NewsFeedItemList itemList = newsParserHandler.getNewsList();
//      checkLog(itemList);

        Log.i("NewsList", "Parsing XML finished. Sending result back to caller...");
        return itemList;
    }

“strServerResponseMsg”包含XML信息(http://www.mania.com.my/rss/ManiaTopStoriesFeedFull.aspx?catid=146

我会解析所有项目,但那些拥有html标签的人将无法完全解析。

这是我的解析器处理程序:

public class NewsParserHandler extends DefaultHandler {

    private NewsFeedItemList newsFeedItemList;  
    private boolean current = false;  
    private String currentValue = null;

   /* Because the feed has another "Title", "link" and "pubdate" name in root, 
    * we need to don't let to be stored in arrays. Therefore, we ignore all of 
    * them by incrementing count.*/
    private int count = 0; 


    @Override
    public void characters(char[] ch, int start, int length) throws SAXException {
        super.characters(ch, start, length);

        if(current)  {  
            currentValue = new String(ch, start, length); 

            if(currentValue==null || currentValue=="" || currentValue==" ")
                currentValue = "-";

            current = false;  
        }
    }

    @Override
    public void startDocument() throws SAXException {
        super.startDocument();

        newsFeedItemList = new NewsFeedItemList();
    }

    @Override
    public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
        super.startElement(uri, localName, qName, attributes);

        current = true;
    }

    @Override
    public void endElement(String uri, String localName, String qName) throws SAXException {
        super.endElement(uri, localName, qName);

        current = false;

        if(localName.equals("title"))  {  
            if(count >= 1)
                newsFeedItemList.setTitle(currentValue);  
        }
        if(localName.equals("description"))  {  
            newsFeedItemList.setDescription(currentValue);  
        } 
        if(localName.equals("fullbody"))  {  
            newsFeedItemList.setFullbody(currentValue);  
        } 
        if(localName.equals("link"))  {  
            if(count >= 4)
                newsFeedItemList.setLink(currentValue);  
        } 
        if(localName.equals("pubDate"))  {  
            if(count >= 5)
                newsFeedItemList.setPubDate(currentValue);  
        } 
        if(localName.equals("image"))  {  
            newsFeedItemList.setImage(currentValue);  
        } 

        count++;
    }

    @Override
    public void endDocument() throws SAXException {
        super.endDocument();
    }   


    public NewsFeedItemList getNewsList() {
        return newsFeedItemList;
    }

}

我尝试将currentValue = Html.fromHtml(currentValue).toString();放在characters()方法中但没有任何效果。在发送“strServerResponseMsg”之前,我尝试将其更改为HTML,但解析器没有解析任何内容。

我找到了这些主题,但他们的解决方案并不适用于我: How to strip or escape html tags in Android Display HTML Formatted String

如果你能帮助我,我非常感激。谢谢。

1 个答案:

答案 0 :(得分:0)

使用以下方法从currentValue变量中删除所有HTML标记。

public static String removeHtmlTag(String htmlString) {
        return htmlString.replaceAll("\\<.*?\\>", "").trim();
}