So I'm trying to read HTML source of a page that contains czech characters (ř, ť, š, ň, etc.). The charset of the page is windows-1250
(Content-type = text/html; charset=windows-1250
).
var hc = new Windows.Web.Http.HttpClient();
var uri = new Windows.Foundation.Uri("http://rozvrhuni.hys.cz/150909.html");
hc.defaultRequestHeaders.acceptLanguage.parseAdd("cs");
hc.defaultRequestHeaders.acceptEncoding.parseAdd("windows-1250");
hc.getStringAsync(uri).done(
function complete(result) {
htmlText = result;
},
function error(result) {
(new Windows.UI.Popups.MessageDialog("Non-existent content", "Error")).showAsync().done();
return;
}
);
My code gets the source but continues to read some characters wrong (ř = ø, č = è, etc.)
What do I do to read the page correctly?
答案 0 :(得分:0)
我不熟悉JavaScript,但我认为这个概念与C#相同。
以下代码在C#中,但我希望它可以帮助您。
string retVal = "";
byte[] bodybytes = {0};
// This 'RegisterProvider' call is enough at once per process.
var provider = System.Text.CodePagesEncodingProvider.Instance;
System.Text.Encoding.RegisterProvider(provider);
var enc = Encoding.GetEncoding("windows-1250");
...
bodybytes = await response.Content.ReadAsByteArrayAsync();
...
retVal = enc.GetString(bodybytes, 0, bodybytes.Length);
注意 - 您可能需要将以下nuget包添加到项目中。 https://www.nuget.org/packages/System.Text.Encoding.CodePages/