使用HttpWebResponse编码问题

时间:2008-10-22 21:17:05

标签: c# encoding

以下是代码片段:

HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(request.RawUrl);
WebRequest.DefaultWebProxy = null;//Ensure that we will not loop by going again in the proxy
HttpWebResponse response = (HttpWebResponse)webRequest.GetResponse();
string charSet = response.CharacterSet;
Encoding encoding;
if (String.IsNullOrEmpty(charSet))
encoding = Encoding.Default;
else
encoding = Encoding.GetEncoding(charSet);

StreamReader resStream = new StreamReader(response.GetResponseStream(), encoding);
return resStream.ReadToEnd();

问题是如果我测试:http://www.google.fr

所有“é”都表现不佳。我试图将ASCII更改为UTF8,但仍然显示错误。我在浏览器中测试了html文件,浏览器显示了html文本,所以我很确定问题出在我用来下载html文件的方法中。

我应该改变什么?

删除了死亡的ImageShack链接

更新1:代码和测试文件已更改

7 个答案:

答案 0 :(得分:29)

CharacterSet是" ISO-8859-1"默认情况下,如果未在服务器的内容类型标题中指定(不同于"字符集" HTML中的元标记)。 我将HttpWebResponse.CharacterSet与HTML的charset属性进行比较。如果它们不同 - 我使用HTML中指定的字符集重新读取页面,但这次使用了正确的编码。

参见代码:

    string strWebPage = "";
    // create request
    System.Net.WebRequest objRequest = System.Net.HttpWebRequest.Create(sURL);
    // get response
    System.Net.HttpWebResponse objResponse;
    objResponse = (System.Net.HttpWebResponse)objRequest.GetResponse();
    // get correct charset and encoding from the server's header
    string Charset = objResponse.CharacterSet;
    Encoding encoding = Encoding.GetEncoding(Charset);
    // read response
    using (StreamReader sr = 
           new StreamReader(objResponse.GetResponseStream(), encoding))
    {
        strWebPage = sr.ReadToEnd();
        // Close and clean up the StreamReader
        sr.Close();
    }

    // Check real charset meta-tag in HTML
    int CharsetStart = strWebPage.IndexOf("charset=");
    if (CharsetStart > 0)
    {
        CharsetStart += 8;
        int CharsetEnd = strWebPage.IndexOfAny(new[] { ' ', '\"', ';' }, CharsetStart);
        string RealCharset = 
               strWebPage.Substring(CharsetStart, CharsetEnd - CharsetStart);

        // real charset meta-tag in HTML differs from supplied server header???
        if(RealCharset!=Charset)
        {
            // get correct encoding
            Encoding CorrectEncoding = Encoding.GetEncoding(RealCharset);

            // read the web page again, but with correct encoding this time
            //   create request
            System.Net.WebRequest objRequest2 = System.Net.HttpWebRequest.Create(sURL);
            //   get response
            System.Net.HttpWebResponse objResponse2;
            objResponse2 = (System.Net.HttpWebResponse)objRequest2.GetResponse();
            //   read response
            using (StreamReader sr = 
              new StreamReader(objResponse2.GetResponseStream(), CorrectEncoding))
            {
                strWebPage = sr.ReadToEnd();
                // Close and clean up the StreamReader
                sr.Close();
            }
        }
    }

答案 1 :(得分:25)

首先,编写该代码的更简单方法是使用StreamReader和ReadToEnd:

HttpWebRequest webRequest = (HttpWebRequest)WebRequest.Create(myURL);
using (HttpWebResponse response = (HttpWebResponse)webRequest.GetResponse())
{
    using (Stream resStream = response.GetResponseStream())
    {
        StreamReader reader = new StreamReader(resStream, Encoding.???);
        return reader.ReadToEnd();
    }
}

那么“只是”找到正确的编码问题。你是怎么创建这个文件的?如果是在记事本中,那么您可能需要Encoding.Default - 但这显然不可移植,因为它是您的 PC的默认编码。

在运行良好的Web服务器中,响应将在其标头中指示编码。话虽如此,在某些情况下,响应标题有时会声称一件事,HTML声称另一件事。

答案 2 :(得分:15)

如果您不想两次下载页面,我会使用How do I put a WebResponse into a memory stream?稍微修改Alex的代码。这是结果

public static string DownloadString(string address)
{
    string strWebPage = "";
    // create request
    System.Net.WebRequest objRequest = System.Net.HttpWebRequest.Create(address);
    // get response
    System.Net.HttpWebResponse objResponse;
    objResponse = (System.Net.HttpWebResponse)objRequest.GetResponse();
    // get correct charset and encoding from the server's header
    string Charset = objResponse.CharacterSet;
    Encoding encoding = Encoding.GetEncoding(Charset);

    // read response into memory stream
    MemoryStream memoryStream;
    using (Stream responseStream = objResponse.GetResponseStream())
    {
        memoryStream = new MemoryStream();

        byte[] buffer = new byte[1024];
        int byteCount;
        do
        {
            byteCount = responseStream.Read(buffer, 0, buffer.Length);
            memoryStream.Write(buffer, 0, byteCount);
        } while (byteCount > 0);
    }

    // set stream position to beginning
    memoryStream.Seek(0, SeekOrigin.Begin);

    StreamReader sr = new StreamReader(memoryStream, encoding);
    strWebPage = sr.ReadToEnd();

    // Check real charset meta-tag in HTML
    int CharsetStart = strWebPage.IndexOf("charset=");
    if (CharsetStart > 0)
    {
        CharsetStart += 8;
        int CharsetEnd = strWebPage.IndexOfAny(new[] { ' ', '\"', ';' }, CharsetStart);
        string RealCharset =
               strWebPage.Substring(CharsetStart, CharsetEnd - CharsetStart);

        // real charset meta-tag in HTML differs from supplied server header???
        if (RealCharset != Charset)
        {
            // get correct encoding
            Encoding CorrectEncoding = Encoding.GetEncoding(RealCharset);

            // reset stream position to beginning
            memoryStream.Seek(0, SeekOrigin.Begin);

            // reread response stream with the correct encoding
            StreamReader sr2 = new StreamReader(memoryStream, CorrectEncoding);

            strWebPage = sr2.ReadToEnd();
            // Close and clean up the StreamReader
            sr2.Close();
        }
    }

    // dispose the first stream reader object
    sr.Close();

    return strWebPage;
}

答案 3 :(得分:3)

这里有一些很好的解决方案,但它们似乎都试图从内容类型字符串中解析字符集。这是使用System.Net.Mime.ContentType的解决方案,它应该更可靠,更短。

 var client = new System.Net.WebClient();
 var data = client.DownloadData(url);
 var encoding = System.Text.Encoding.Default;
 var contentType = new System.Net.Mime.ContentType(client.ResponseHeaders[HttpResponseHeader.ContentType]);
 if (!String.IsNullOrEmpty(contentType.CharSet))
 {
      encoding = System.Text.Encoding.GetEncoding(contentType.CharSet);
 }
 string result = encoding.GetString(data);

答案 4 :(得分:1)

这是下载一次的代码。

class BaseClass(object):

    @classmethod
    def load_data(cls):
        try:
            return some_external_load_function(cls.DATA_FILE_NAME)
        except AttributeError:
            raise NotImplementedError(
                'It seems you forgot to define the DATA_FILE_NAME attribute '
                'on you child class.')

class Child1(BaseClass):
    DATA_FILE_NAME = 'my_one_data_file.data'

class Child2(BaseClass):
    DATA_FILE_NAME = 'my_other_data_file.data'

答案 5 :(得分:0)

我在一个很棒的协议分析器WireShark的帮助下研究了同样的问题。我认为httpWebResponse类有一些设计简介。事实上,在您第一次调用HttpWebRequest类的GetResponse()方法时下载了整个消息实体,但是框架没有地方可以将数据保存在HttpWebResponse类或其他地方,导致您必须获取响应流第二次。

答案 6 :(得分:0)

从WebRequest请求网页“www.google.fr”时仍然存在一些问题。

我用Fiddler检查了原始请求和响应。问题来自Google服务器。响应HTTP标头设置为charset = ISO-8859-1,文本本身用ISO-8859-1编码,而HTML表示charset = UTF-8。这是不连贯的,导致编码错误。

经过多次测试,我找到了解决方法。只需添加:

myHttpWebRequest.UserAgent = "Mozilla/5.0";

到您的代码,Google Response将神奇地完全变为UTF-8。