我正在使用Jsoup
解析XMLPullParser<title>(??????) [????]0 BLACK LAGOON -???? · ????- ?01-09?</title>
<guid isPermaLink='true'>http://fenopy.eu/torrent/+black+lagoon+A+01+09+/OTcyOTA3Mw</guid>
<pubDate>Wed, 27 Feb 2013 11:00:04 GMT</pubDate>
<category>Anime</category>
<link>http://fenopy.eu/torrent/+black+lagoon+A+01+09+/OTcyOTA3Mw</link>
<enclosure url="http://fenopy.eu/torrent/-BLACK-LAGOON-01-09-/OTcyOTA3Mw==/download.torrent" length="569296173" type="application/x-bittorrent" />
<description><![CDATA[ Category: Anime<br/>Size: 542.9 MB<br/>Ratio: 0 seeds, 3 leechers<br/> ]]></description>
</item>
这是我的解析代码
int eventType = -1;
while (eventType != XmlPullParser.END_DOCUMENT) {
switch (eventType) {
// at start of document: START_DOCUMENT
case XmlPullParser.START_DOCUMENT:
break;
// at start of a tag: START_TAG
case XmlPullParser.START_TAG:
// get tag name
String tagName = parser.getName();
if (tagName.equalsIgnoreCase(TAG_TITLE))
String t = parser.nextText();
当我调用下一个文本时,它会抛出异常..
org.xmlpull.v1.XmlPullParserException: unresolved: · (position:TEXT (??????) [????] ...@36:59 in java.io.StringReader@40540698)
at org.kxml2.io.KXmlParser.exception(KXmlParser.java:273)
at org.kxml2.io.KXmlParser.error(KXmlParser.java:269)
at org.kxml2.io.KXmlParser.pushEntity(KXmlParser.java:818)
at org.kxml2.io.KXmlParser.pushText(KXmlParser.java:849)
at org.kxml2.io.KXmlParser.nextImpl(KXmlParser.java:354)
at org.kxml2.io.KXmlParser.next(KXmlParser.java:1378)
at org.kxml2.io.KXmlParser.nextText(KXmlParser.java:1432)
答案 0 :(得分:6)
我正在处理同样的问题,我找到了超级简单的解决方案:
xmlPullParser.setFeature(Xml.FEATURE_RELAXED, true);
答案 1 :(得分:1)
您的xml无效。 ·
是xml的无效引用。
XML中有5个预定义的实体引用:
<
&lt;小于
>
&gt;大于
&
&amp; &符号
'
'撇号
"
“引号
<强>更新强>
简单地使用正则表达式替换XML中的所有HTML字符
XMLString.replaceAll("(&[^\\s]+?;)", ""));
这会将·
替换为“”
答案 2 :(得分:1)
也许你可以这样做:
parser.setInput(...);
parser.defineEntityReplacementText("middot", "•");
因为这不适用于您的实施:
来自apache commons-lang使用HTML转换,因为它似乎是HTML命名实体:
String xml = "<foo>Hello · World!</foo>";
xml = StringEscapeUtils.unescapeHtml(xml);
评论的问题:
取代所有不分青红皂白的人:
String xml = "<...";
// Place all entities like "·" in square brackets: "[middot]":
xml = xml.replaceAll("\\&(\\w+);", "[$1]");
// But not for the xml entities:
xml = xml.replaceAll("\\[(lt|gt|amp|quot|apos)\\]", "&$1;");