Question

当我做以下事情时：

try {
    URL url = new URL(urlAsString);
    //using proxy may increase latency
    HttpURLConnection hConn = (HttpURLConnection) url.openConnection(Proxy.NO_PROXY);
    // force no follow
    hConn.setInstanceFollowRedirects(false);
    // the program doesn't care what the content actually is       
    hConn.setRequestMethod("HEAD");
    // default is 0 => infinity waiting
    hConn.setConnectTimeout(timeout);
    hConn.setReadTimeout(timeout);
    hConn.connect();
    int responseCode = hConn.getResponseCode();
    hConn.getInputStream().close();
    if (responseCode == HttpURLConnection.HTTP_OK)
        return urlAsString;

    String loc = hConn.getHeaderField("Location");
    if (responseCode == HttpURLConnection.HTTP_MOVED_PERM && loc != null)
        return loc.replaceAll(" ", "+");

} catch (Exception ex) {
}
return "";

表示该网址：http://bit.ly/gek1qK我正在

http://blog.tweetsmarter.com/twitter-downtime/twitter - redesignsâthen-一切-场所/

这是错误的。 Firefox解析为

http://blog.tweetsmarter.com/twitter-downtime/twitter-redesigns%E2%80%94then-everything-breaks/

代码有什么问题？

Answer 1

根据RFC 2616, section 2.2，HTTP标头值通常应使用ISO-8859-1进行编码。

这里，bit.ly发送错误的响应 - Location：标头使用UTF-8编码，因此em-dash字符由三个单独的字节（0xe2,0x80,0x94）表示。

HttpURLConnection使用ISO-8859-1对字节进行解码，因此它们变为三个字符（â和两个未定义的字符），~~但看起来好像是使用UTF重新编码它们8（在生成URL编码~~之前，每个字符产生2个字节，因为所有三个字符都具有值> = 0x80）。

Firefox最有可能将数据视为ISO-8859-1;然后，当稍后应用URL编码时，该问题会自行取消。

您可以通过对getHeaderField()返回的值进行网址编码来执行相同操作;由于Unicode范围U + 0080到U + 00FF与ISO-8859-1字节范围0x80-0xFF相同，因此可以通过将非ASCII字符转换为int值来对其进行编码：

/**
 * Takes a URI that was decoded as ISO-8859-1 and applies percent-encoding
 * to non-ASCII characters. Workaround for broken origin servers that send
 * UTF-8 in the Location: header.
 */
static String encodeUriFromHeader(String uri) {
    StringBuilder sb = new StringBuilder();

    for(char ch : badLocation.toCharArray()) {
        if(ch < (char)128) {
            sb.append(ch);
        } else {
            // this is ONLY valid if the uri was decoded using ISO-8859-1
            sb.append(String.format("%%%02X", (int)ch));
        }
    }

    return sb.toString();
}

Answer 2

没有错。区别在于m-Dash在不同编码中表示不同。因此，如果Firefox使用的编码不是您的程序所做的，那么您将看到不同的字符。

在您的情况下，两者都是正确的。这只是编码的问题。在Java中，您使用UTF-8，即World Wide Web Consortium Recommendation;而你在FF中看到的似乎是ISO-8859。

如果您想在Java中生成与Firefox相同的结果，请尝试以下方法：

System.out.print(URLEncoder.encode(loc.replace(" ", "+"), "ISO-8859-1"));

它将打印您在Firefox中看到的内容。（显然，它也会编码/和:。但只是为了证明）

使用Java解析URL会在URL中给出错误的编码字符

2 个答案: