KXmlParser在RSS pasing开始时抛出“Unexpected token”异常

时间:2013-03-06 17:31:25

标签: android rss xmlpullparser

我正在尝试使用以下网址解析Android v.17上Monster的RSS提要:

http://rss.jobsearch.monster.com/rssquery.ashx?q=java

以下列方式获取我正在使用HttpUrlConnection的内容

this.conn = (HttpURLConnection) url.openConnection();
this.conn.setConnectTimeout(5000);
this.conn.setReadTimeout(10000);
this.conn.setUseCaches(true);
conn.addRequestProperty("Content-Type", "text/xml; charset=utf-8");
is = new InputStreamReader(url.openStream());

据我所知,(我也验证了)一个合法的RSS

Cache-Control:private
Connection:Keep-Alive
Content-Encoding:gzip
Content-Length:5958
Content-Type:text/xml
Date:Wed, 06 Mar 2013 17:15:20 GMT
P3P:CP=CAO DSP COR CURa ADMa DEVa IVAo IVDo CONo HISa TELo PSAo PSDo DELa PUBi BUS LEG PHY ONL UNI PUR COM NAV INT DEM CNT STA HEA PRE GOV OTC
Server:Microsoft-IIS/7.5
Vary:Accept-Encoding
X-AspNet-Version:2.0.50727
X-Powered-By:ASP.NET

它是这样开始的(如果你想看到完整的XML,点击上面的URL):

<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
  <channel>
    <title>Monster Job Search Results java</title>
    <description>RSS Feed for Monster Job Search</description>
    <link>http://rss.jobsearch.monster.com/rssquery.ashx?q=java</link>

但是当我试图解析它时:

final XmlPullParser xpp = getPullParser();
xpp.setInput(is);
for (int type = xpp.getEventType(); type != XmlPullParser.END_DOCUMENT; type = xpp.next()) { /* pasing goes here */ }

代码会立即在type = xpp.next()上使用以下异常

进行阻塞
03-06 09:27:27.796: E/AbsXmlResultParser(13363): org.xmlpull.v1.XmlPullParserException: 
   Unexpected token (position:TEXT @1:2 in java.io.InputStreamReader@414b4538) 

这实际上意味着它无法在第1行<?xml version="1.0" encoding="utf-8"?>

处理第二个字符

以下是KXmlParser.java(425-426)中的违规行。类型== TEXT的计算结果为true

if (depth == 0 && (type == ENTITY_REF || type == TEXT || type == CDSECT)) {
    throw new XmlPullParserException("Unexpected token", this, null);
}

有任何帮助吗?我确实尝试将解析器设置为XmlPullParser.FEATURE_PROCESS_DOCDECL = false,但这没有帮助

我在网上和此处进行了研究,找不到任何有帮助的内容

1 个答案:

答案 0 :(得分:34)

您收到错误的原因是xml文件实际上并不以<?xml version="1.0" encoding="utf-8"?>开头。它以三个特殊字节EF BB BF开头,Byte order mark

Hex representation

InputStreamReader不会自动处理这些字节,因此您必须手动处理它们。最简单的方法是使用BOMInpustStream库中的Commons IO

this.conn = (HttpURLConnection) url.openConnection();
this.conn.setConnectTimeout(5000);
this.conn.setReadTimeout(10000);
this.conn.setUseCaches(true);
conn.addRequestProperty("Content-Type", "text/xml; charset=utf-8");
is = new InputStreamReader(new BOMInputStream(conn.getInputStream(), false, ByteOrderMark.UTF_8));  

我已经检查了上面的代码,它对我来说效果很好。