我使用.NET的WebRequest“屏幕抓取”我自己的页面作为临时黑客。
这很有效,但重音字符和变音字符无法正确翻译。
我想知道是否有办法使用.NET的许多内置属性和方法使它们正确翻译。
以下是我用来抓取页面的代码:
private string getArticle(string urlToGet)
{
StreamReader oSR = null;
//Here's the work horse of what we're doing, the WebRequest object
//fetches the URL
WebRequest objRequest = WebRequest.Create(urlToGet);
//The WebResponse object gets the Request's response (the HTML)
WebResponse objResponse = objRequest.GetResponse();
//Now dump the contents of our HTML in the Response object to a
//Stream reader
oSR = new StreamReader(objResponse.GetResponseStream());
//And dump the StreamReader into a string...
string strContent = oSR.ReadToEnd();
//Here we set up our Regular expression to snatch what's between the
//BEGIN and END
Regex regex = new Regex("<!-- content_starts_here //-->((.|\n)*?)<!-- content_ends_here //-->",
RegexOptions.IgnoreCase);
//Here we apply our regular expression to our string using the
//Match object.
Match oM = regex.Match(strContent);
//Bam! We return the value from our Match, and we're in business.
return oM.Value;
}
答案 0 :(得分:2)
尝试使用:
System.Net.WebClient client = new System.Net.WebClient();
string html = client.DownloadString(urlToGet);
string decoding = System.Web.HttpUtility.HtmlDecode(html);
另外,请查看client.Encoding
答案 1 :(得分:0)
还有另一种处理方法,使用StreamReader构造函数的第二个参数,如下所示:
new StreamReader(webRequest.GetResponse().GetResponseStream(),
Encoding.GetEncoding("ISO-8859-1"));
那就可以了。