我试图操纵RSS内容并使用Java将其转换为JSON对象。我使用的来源是:
http://rss.uol.com.br/feed/economia.xml 和 http://g1.globo.com/dynamo/pr/parana/rss2.xml
首先,我尝试过这样的事情:
DocumentBuilderFactory f = DocumentBuilderFactory.newInstance();
DocumentBuilder b = f.newDocumentBuilder();
Document doc = b.parse(myUrl);
//working with doc variable...
第二个网址正常(我认为因为它上面有<?xml version="1.0" encoding="utf-8"?>
)。但第二个失败了
2字节UTF-8序列的字节2无效
所以,我尝试做这样的事情:
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);
InputStream is = new ByteArrayInputStream(myUrl.getBytes());
Reader reader = new InputStreamReader(is, "UTF-8");
InputSource io = new InputSource(reader);
io.setEncoding("UTF-8");
Document doc = dbf.newDocumentBuilder().parse(io);
//working with doc variable...
但是现在第二个网址给了我这个错误:
prolog中不允许使用内容
虽然第一个仍然给我同样的错误。
如何在不收到任何字符集错误的情况下从其URL中读取RSS文件?
答案 0 :(得分:0)
这对我有用(JavaSE 8)
String uri="http://rss.uol.com.br/feed/economia.xml";
URL myUrl=new URL(uri);
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);
//If you do not need a proxy:
//InputStream is=myUrl.openStream();
//I needed a proxy:
java.net.Proxy pxy =new java.net.Proxy( java.net.Proxy.Type.HTTP,new java.net.InetSocketAddress("apoderado-externo",8080));
URLConnection urlConn=myUrl.openConnection(pxy);
urlConn.connect();
InputStream is =urlConn.getInputStream();
//Prepare to parse the stuff
Reader reader = new InputStreamReader(is, "UTF-8");
InputSource io = new InputSource(reader);
io.setEncoding("UTF-8");
Document doc = dbf.newDocumentBuilder().parse(io);
//working with doc variable...
我认为不仅xml不提供编码,而且HTTP标题'Content-type'似乎也不存在。也许BOM不存在......