Question

我有一个引用以下网址的数据库：

http://en.wikipedia.org/wiki/Herbert_Gr%F6nemeyer

然而，似乎这是一个糟糕的URLEncoding，导致HttpUtility.UrlDecode（给我垃圾）和Uri.UnescapeDataString（UriFormatException）两个问题。我的浏览器将路径传递给Wikipedia（我假设％F6由浏览器编码），如下所示：

GET / wiki / Herbert_Gr％F6nemeyer HTTP / 1.1

维基百科认识到301并重定向到：

地点：http://en.wikipedia.org/wiki/Herbert_Gr%C3%B6nemeyer

这里发生了什么？维基百科是否有额外的专有编码？

编辑：我有一个维基百科的本地副本，我试图交叉引用这个网址。这些文章按标题索引，在这种情况下将是：“HerbertGrönemeyer”。任何人都可以建议我如何从代码中的“Herbert_Gr％F6nemeyer”转到“HerbertGrönemeyer”。显然，下划线不是问题所在。

Answer 1

％C3％B6是ö（o-umlaut）的正确UTF-8编码。我假设％F6是相同字符的某些本地编码的字节值的逐字节复制（例如，来自代码页1252）。

Answer 2

这里有一些快速的代码，我拼凑起来理解这一点。感谢Josip指出我正确的方向：

    private string UrlDecode(string input)
    {
        string unescaped = null;
        try
        {
            unescaped = Uri.UnescapeDataString(input);
        }
        catch
        {
            unescaped = input;
            for (; ; )
            {
                var match = Regex.Match(unescaped, @"\%[A-F0-9]{2}");
                if (!match.Success)
                    break;
                byte b;
                try
                {
                    b = byte.Parse(match.Value.Substring(1), NumberStyles.HexNumber);
                }
                catch
                {
                    return HttpUtility.UrlDecode(input);
                }
                var replacement = Encoding.GetEncoding(1252).GetString(new[] { b });
                unescaped = unescaped.Substring(0, match.Index) + replacement + unescaped.Substring(match.Index + match.Length);
            }
        }
        return unescaped;
    }

URL解码混乱

2 个答案: