错误的编码读取html源代码

时间:2018-04-18 04:37:32

标签: c# html unicode encoding utf-8

我从这个网址获取html来源:“http://duhoc.dantri.com.vn/du-hoc/30-hoc-sinh-trung-tuyen-dai-hoc-my-nam-2018-chia-se-bi-kip-thanh-cong-20180418093640358.htm”by:

      private static string getPageSource(string url)
    {
        try
        {
            HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
            request.UserAgent = "SO/1.0";
            HttpWebResponse response = (HttpWebResponse)request.GetResponse();
            if (response.StatusCode == HttpStatusCode.OK)
            {
                Stream receiveStream = response.GetResponseStream();
                StreamReader readStream = null;

                //if (response.CharacterSet == null)
                //{
                readStream = new StreamReader(receiveStream, Encoding.UTF8);
                //}
                string data = readStream.ReadToEnd();
                response.Close();
                readStream.Close();
                return data;
            }
        }
        catch (Exception ex)
        {
            WriteLog("Exception get Page Source, Ex = " + ex.ToString());
        }
        return null;
    }

浏览器显示页面的标题如下:“30họcsinhtúngtuểnđạihọcMỹnăm2018chiasẻ”bíkíp“thànhcông”但当我从该页面获得html源代码时通过调用上面给出的方法,页面的标题变为“30họcsinhtúngtuểnđạihọcMỹnăm2018chiasẻ”bí kí p“thà nh c&#244 ; ng “。为了解决这个问题,我将UTF8改为:

      Encoding encode = System.Text.Encoding.GetEncoding(1255)

和UTF7,UTF32,但没有任何工作。那么,我做错了什么?

0 个答案:

没有答案