用c#阅读非英文html页面

时间:2010-06-09 17:59:02

标签: html unicode hebrew

我正试图在网站上找到希伯来语中的字符串。附上阅读代码。

之后我尝试使用streamReader读取文件,但我无法匹配其他语言的字符串。 我想做什么?

   // used on each read operation
    byte[] buf = new byte[8192];

    // prepare the web page we will be asking for
    HttpWebRequest request = (HttpWebRequest)
        WebRequest.Create("http://www.webPage.co.il");

    // execute the request
    HttpWebResponse response = (HttpWebResponse)
        request.GetResponse();

    // we will read data via the response stream
    Stream resStream = response.GetResponseStream();

    string tempString = null;
    int count = 0;
    FileStream fileDump = new FileStream(@"c:\dump.txt", FileMode.Create);
    do
    {
        count = resStream.Read(buf, 0, buf.Length);
        fileDump.Write(buf, 0, buf.Length);

    }
    while (count > 0); // any more data to read?

    fileDump.Close();

2 个答案:

答案 0 :(得分:0)

您缺少合适的编码器,请查看WebResponse.GetResponseStream Method了解详情

更新:使用希伯来语(Windows)编码为1255

Encoding encode = System.Text.Encoding.GetEncoding(1255); // Hebrew (Windows) 

// Pipe the stream to a higher level stream reader with the required encoding format. 
 StreamReader readStream = new StreamReader( resStream , encode );

答案 1 :(得分:0)

解决了它。

问题是选择了错误的编码,我选择utf-8并不总是正确的答案:)

关键行:

Encoding encode = System.Text.Encoding.GetEncoding("windows-1255");
StreamReader readStream = new StreamReader(ReceiveStream, encode);