尝试从URL读取UTF-8文档时获取奇怪的字符

时间:2013-11-22 15:44:00

标签: java file-io

当我尝试阅读以下网址并将其存储到本地文件

private void testStoreFeedToLocalFile() throws IOException{
SyndFeed feed = null;
InputStream is = null;      
try {        
    Proxy proxy = new Proxy(Proxy.Type.HTTP, new InetSocketAddress("proxy_url", proxy_port_number));
    URLConnection openConnection = new URL("http://www.deutschlandradio.de/podcast-bildung-und-wissenschaft.419.de.podcast").openConnection(proxy);
    HttpURLConnection httpOpenConnection = (HttpURLConnection)openConnection;
    if(httpOpenConnection.getResponseCode() >= 400){
        is = httpOpenConnection.getErrorStream();
    } else {
        is = httpOpenConnection.getInputStream();
    }

    Reader inReader = new InputStreamReader(is, "UTF-8");
    BufferedReader in = new BufferedReader(inReader);

    BufferedWriter writer = new BufferedWriter
            (new OutputStreamWriter(new FileOutputStream("C:/feed.xml"), "UTF-8"));  

    String feedText = null;
    while ((feedText = in.readLine()) != null) {
        // Keep in mind that readLine() strips the newline characters
        writer.write(feedText + "\n");
        System.out.println(feedText);
    }
    writer.close();

} catch(Exception e) {
    System.out.println("\n ++++++++++++++ ERROR testStoreFeedToLocalFile ++++++++++++++ \n");
    e.printStackTrace();
    System.out.println("\n \n");    
} finally {
    if(is!=null){
        is.close();
    }
}   

}

我在控制台和feed.xml创建的文件中得到了一堆奇怪的字符(?? 9j?n ^ ???? P ??等)。知道如何解决这个问题吗?

1 个答案:

答案 0 :(得分:0)

为什么你觉得有必要将流转换为字符。为什么不直接使用Streams而不是Reader / Writer传递字节?

OutputStream os = new FileOutputStream("C:/feed.xml"), "UTF-8")
byte[] buffer = new byte[4096];
int len = 0;
while ((len = is.read(buffer)) > 0) {
    os.write(buffer, 0, len);
}
os.flush();
os.close();

编辑:

压缩inut流:

~/junk $ curl -v -O http://www.deutschlandradio.de/podcast-bildung-und-wissenschaft.419.de.podcast
* About to connect() to www.deutschlandradio.de port 80 (#0)
*   Trying 217.69.91.96... connected
* Connected to www.deutschlandradio.de (217.69.91.96) port 80 (#0)
> GET /podcast-bildung-und-wissenschaft.419.de.podcast HTTP/1.1
> User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.12.9.0 zlib/1.2.3 libidn/1.18 libssh2/1.2.2
> Host: www.deutschlandradio.de
> Accept: */*
> 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0< HTTP/1.1 200 OK
< Server: Apache
< P3P: CP="NOI NID ADMa OUR IND UNI COM NAV"
< Expires: Fri, 22 Nov 2013 15:48:50 GMT
< Cache-Control: public, max-age=60, pre-check=60, no-transform
< Pragma:
< Last-modified: Fri, 22 Nov 2013 15:47:51 GMT
< Content-Encoding: gzip
< X-DW-Server: www05
< Content-Type: application/xml; charset=utf-8
< Content-Length: 5111
< Date: Fri, 22 Nov 2013 15:53:18 GMT
< X-Varnish: 1993556337 1993540162
< Age: 327
< Via: 1.1 varnish
< Connection: keep-alive

你需要识别它,并妥善处理它:

if ("gzip".equals(httpOpenConnection.getContentEncoding()) {
    is = new GZipInputStream(is);
}
OutputStream os = new FileOutputStream("C:/feed.xml"), "UTF-8")
byte[] buffer = new byte[4096];
int len = 0;
while ((len = is.read(buffer)) > 0) {
    os.write(buffer, 0, len);
}
os.flush();
os.close();