我尝试解析xml时得到无效字节

时间:2013-12-08 17:08:53

标签: java xml rss

我尝试从当天的NASA图像中读取/解析rss feed。 这是下面的代码。我得到了一个例外,告诉我这个:

com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.arrangeCapacity(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.skipString(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
at Start.processFeed(Start.java:30)
at Loader.main(Loader.java:12)

我做错了什么?

P.S。当然我有另一个主要方法的课程:)

提前致谢。

import java.io.InputStream;
import java.net.URL;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;

import org.xml.sax.Attributes;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;


public class Start extends DefaultHandler {

    private String url = "http://www.nasa.gov/rss/dyn/image_of_the_day.rss";
    private boolean inUrl = false;
    private boolean inTitle = false;
    private boolean inDescription = false;
    private boolean inItem = false;
    private boolean inDate = false;

    public void processFeed() {
            try {
            SAXParserFactory factory = 
                SAXParserFactory.newInstance();
            SAXParser parser = factory.newSAXParser();
            XMLReader reader = parser.getXMLReader();
            reader.setContentHandler(this);
            InputStream inputStream = new URL(url).openStream();
            reader.parse(new InputSource(inputStream));
        } catch(Exception e) {
            e.printStackTrace();
        }
    } // processFeed


    @Override
    public void startElement(String uri, String localName, String qName,
        Attributes attributes) throws SAXException {

    if(localName.startsWith("item")) { inItem = true; }
    else if (inItem) {
        if(localName.equals("title")) { inTitle = true; }
        else { inTitle = false; }

        if(localName.equals("description")) { inDescription = true; }
        else { inDescription = false; }

        if(localName.equals("pubDate")) { inDate = true; }
        else { inDate = false; }
    }

}


@Override
public void characters(char[] ch, int start, int length)
        throws SAXException {
    String chars = new String(ch).substring(start, start + length);

    if(inTitle) { System.out.println(chars); }
    if(inDescription) {  System.out.println(chars); }
    if(inDate) { System.out.println(chars); }
}

}

1 个答案:

答案 0 :(得分:1)

响应实体是gzip编码的(所以它是压缩的)!您可以将输入流包装到GZIPInputStream

InputStream inputStream = new GZIPInputStream(new URL(url).openStream());

您应该通过URLConnnection使用“长格式”阅读网址,以便您可以更好地控制连接,并可以测试内容是否已压缩。

URL url = new URL(urlString);
HttpURLConnection con = (HttpURLConnection) url.openConnection();
// we're not really connected now. Just the connection object has been created
// here you can set additional request properties (e.g. request headers)
con.connect();
// now we are connected!
if (con.getResponseCode() == HttpURLConnection.HTTP_OK) {
    try (InputStream entityStream = con.getInputStream()) {
        InputStream is;
        if ("gzip".equals(con.getContentEncoding())) {
            is = new GZIPInputStream(entityStream); // wrap
        } else {
            is = entityStream;
        }

        reader.parse(new InputSource(is));
    }
} else {
    // handle HTTP response code != OK
}
con.disconnect();