Question

我正在尝试从URL中读取，然后打印结果。

BufferedReader in = new BufferedReader(
     new InputStreamReader(new URL("http://somesite.com/").openStream(), "UTF-8"));
String s = "";
while ((s=in.readLine())!=null) System.out.println(s);
in.close();

大部分时间都很好用，并打印网站的来源。但是，我的问题是，在特定的网站上，它会打印出乱码，例如符号和其他不寻常的字符，而不是源代码。

是否有某些属性因网站而异，会影响其阅读方式？该页面在Firefox中加载得很好，我可以在那里查看源代码没有问题。如果firefox可以访问源代码，我也应该能够访问;我只是不确定它为什么不起作用......

编辑：将“UTF-8”添加到InputStreamReader。所有奇怪的角色现在都是问号...仍然没有工作......

Answer 1

经过多次搜索，我找到了答案。 xml被读作乱码，因为它是Gzip压缩的。读取它的方法是使用GZIPInputStream。这是因为XML的压缩方式不同。

HttpURLConnection connection = (HttpURLConnection) url.openConnection();
    connection.setRequestProperty("Accept-Encoding", "gzip");
    InputStreamReader in = new InputStreamReader (new GZIPInputStream(connection.getInputStream()));
    String str;            
    while (true) {
 int ch = in.read();
 if (ch==-1) {
    break;
 }

Answer 2

您可能遇到了字符编码问题。

响应中应该有如下的HTTP标头：

Content-Type: text/html; charset=UTF-8

Answer 3

尝试使用telnet来诊断线路上的内容。它可能不是文本数据。例如，当你这样做时会发生什么？

telnet somesite.com 80
GET / HTTP/1.0
Host: somesite.com

（最后一行后需要两次回车）

这应该允许您查看标题和内容，并且应该为您提供更好的线索。

Answer 4

在我使用setChunkedStreamingMode设置的HttpURLConnection之前，我遇到了同样的问题。

            HttpURLConnection connection = (HttpURLConnection)serverAddress.openConnection();
            connection.setRequestMethod("GET");
            connection.setDoOutput(true);
            connection.setReadTimeout(2000);
            connection.setChunkedStreamingMode(0);

            connection.connect();

            BufferedReader rd  = new BufferedReader(new InputStreamReader(connection.getInputStream()));

            String line = "";

            while ((line = rd.readLine()) != null)
            {
                sb.append(line);
            }

            System.out.println(sb.toString());

尝试从URL（Java）中读取会在某些情况下产生乱码

4 个答案: