Question

So I'm trying to read HTML source of a page that contains czech characters (ř, ť, š, ň, etc.). The charset of the page is windows-1250 (Content-type = text/html; charset=windows-1250).

    var hc = new Windows.Web.Http.HttpClient();
    var uri = new Windows.Foundation.Uri("http://rozvrhuni.hys.cz/150909.html");
    hc.defaultRequestHeaders.acceptLanguage.parseAdd("cs");
    hc.defaultRequestHeaders.acceptEncoding.parseAdd("windows-1250");
    hc.getStringAsync(uri).done(
        function complete(result) {
            htmlText = result;
        },
        function error(result) {
            (new Windows.UI.Popups.MessageDialog("Non-existent content", "Error")).showAsync().done();
            return;
        }
    );

My code gets the source but continues to read some characters wrong (ř = ø, č = è, etc.)

What do I do to read the page correctly?

Answer 1

我不熟悉JavaScript，但我认为这个概念与C＃相同。

以下代码在C＃中，但我希望它可以帮助您。

string retVal = "";
byte[] bodybytes = {0};

// This 'RegisterProvider' call is enough at once per process.
var provider = System.Text.CodePagesEncodingProvider.Instance;
System.Text.Encoding.RegisterProvider(provider);

var enc = Encoding.GetEncoding("windows-1250");
...
bodybytes = await response.Content.ReadAsByteArrayAsync();
...
retVal = enc.GetString(bodybytes, 0, bodybytes.Length);

注意 - 您可能需要将以下nuget包添加到项目中。 https://www.nuget.org/packages/System.Text.Encoding.CodePages/

Windows.Web.Http.HttpClient character encoding

1 个答案: