Question

我需要下载一个网页，我有以下代码来阻止编码

                System.IO.StreamReader sr=null;

                mFrm.InfoShotcut("Henter webside....");
                if(response.ContentEncoding!=null && response.ContentEncoding!="")
                {
                    sr=new System.IO.StreamReader(srm,System.Text.Encoding.GetEncoding(response.ContentEncoding));
                }
                else
                {
                    //System.Windows.Forms.MessageBox.Show();
                    sr=new  System.IO.StreamReader(srm,System.Text.Encoding.GetEncoding(response.CharacterSet));
                }

                if(sr!=null)
                {
                    result=sr.ReadToEnd();

                     if(response.CharacterSet!=GetCharatset(result))
                    {
                        System.Text.Encoding CorrectEncoding=System.Text.Encoding.GetEncoding(GetCharatset(result));

                        HttpWebRequest client2=(HttpWebRequest)HttpWebRequest.Create(Helper.value1);

                        HttpWebResponse response2=(HttpWebResponse)client2.GetResponse();

                        System.IO.Stream srm2=response2.GetResponseStream();

                        sr=new System.IO.StreamReader(srm2,CorrectEncoding);

                        result=sr.ReadToEnd();
                    }
                }

                mFrm.InfoShotcut("Henter webside......");
            }
            catch (Exception ex)
            {
                // handle error
                MessageBox.Show( ex.Message );
            }

它工作得很好，但现在我已经尝试了一个网站，它声明它使用

<pre>
&lt;META http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
</pre>

但是真的是UTF-8，我怎么知道我可以用正确的编码保存文件。

Answer 1

首先，Content-Encoding标题不描述正在使用的字符集。正如RFC所说：

内容编码主要用于允许压缩文档或以其他方式有用地转换文档，而不会丢失其基础媒体类型的身份并且不会丢失信息。

使用的字符集在Content-Type标题中描述。例如：

Content-Type: text/html; charset=UTF-8

上面使用Content-Encoding标头的代码无法正确识别字符集。您必须查看Content-Type标头，找到分号（如果有分号），然后解析charset参数。

而且，正如您所发现的，它也可以在HTML META标记中进行描述。

或者，可能根本没有字符集定义，在这种情况下，您必须默认为某些内容。我的经验是默认为UTF-8是一个不错的选择。这不是100％可靠，但似乎不包含charset字段通常的Content-Type参数的网站默认为UTF-8。我还发现META标签，如果它们存在，几乎有一半的时间是错误的。

正如L.B在他的评论中提到的那样，可以下载字节并检查它们以确定编码。这可以以惊人的准确度完成，但它需要大量的代码。

从网络获取页面时的编码

1 个答案: