从URL解析RSS给我" 2字节UTF-8序列的字节2无效"

时间:2016-12-15 17:21:39

标签: java xml utf-8 rss

我试图操纵RSS内容并使用Java将其转换为JSON对象。我使用的来源是:

http://rss.uol.com.br/feed/economia.xmlhttp://g1.globo.com/dynamo/pr/parana/rss2.xml

首先,我尝试过这样的事情:

DocumentBuilderFactory f = DocumentBuilderFactory.newInstance();
DocumentBuilder b = f.newDocumentBuilder();
Document doc = b.parse(myUrl);
//working with doc variable...

第二个网址正常(我认为因为它上面有<?xml version="1.0" encoding="utf-8"?>)。但第二个失败了

  

2字节UTF-8序列的字节2无效

所以,我尝试做这样的事情:

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);

InputStream is = new ByteArrayInputStream(myUrl.getBytes());
Reader reader = new InputStreamReader(is, "UTF-8");
InputSource io = new InputSource(reader);

io.setEncoding("UTF-8");
Document doc = dbf.newDocumentBuilder().parse(io);
//working with doc variable...

但是现在第二个网址给了我这个错误:

  

prolog中不允许使用内容

虽然第一个仍然给我同样的错误。

如何在不收到任何字符集错误的情况下从其URL中读取RSS文件?

1 个答案:

答案 0 :(得分:0)

这对我有用(JavaSE 8)

String uri="http://rss.uol.com.br/feed/economia.xml";   
URL myUrl=new URL(uri);

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);

//If you do not need a proxy:
//InputStream is=myUrl.openStream();

//I needed a proxy: 
java.net.Proxy pxy =new java.net.Proxy( java.net.Proxy.Type.HTTP,new java.net.InetSocketAddress("apoderado-externo",8080));
URLConnection urlConn=myUrl.openConnection(pxy);
urlConn.connect();
InputStream is =urlConn.getInputStream();


//Prepare to parse the stuff 
Reader reader = new InputStreamReader(is, "UTF-8");
InputSource io = new InputSource(reader);

io.setEncoding("UTF-8");
Document doc = dbf.newDocumentBuilder().parse(io);
//working with doc variable...

我认为不仅xml不提供编码,而且HTTP标题'Content-type'似乎也不存在。也许BOM不存在......