下载页面的HTML会返回“ \u0003�T���0”

时间:2019-02-10 20:11:02

标签: c# html web-scraping

我正在尝试获取此页面的html

https://ec.europa.eu/esco/portal/skill?uri=http%3A%2F%2Fdata.europa.eu%2Fesco%2Fskill%2F00735755-adc6-4ea0-b034-b8caff339c9f&conceptLanguage=en&full=true

但是由于某种原因,我收到的输出是这样的:

\0\0\0\0\0\0\u0003�T���0\u0010�#�\u000f�\aNM�.+�b�\"v�\u0010�\u0015+��\u001b����[�\u000e���\u001e�\v���

代码如下:

using (WebClient client = new WebClient())
{
    client.Headers.Add("Host", "ec.europa.eu");
    client.Headers.Add("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv,65.0) Gecko/20100101 Firefox/65.0");
    client.Headers.Add("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8");
    client.Headers.Add("Accept-Language", "pl,en-US;q=0.7,en;q=0.3");
    client.Headers.Add("Accept-Encoding", "gzip, deflate, br");
    client.Headers.Add("DNT", "1");
    client.Headers.Add("Cookie", "JSESSIONID=-(...); escoLanguage=en");

    var output = client.DownloadString(new Uri("https://ec.europa.eu/esco/portal/skill?uri=http%3A%2F%2Fdata.europa.eu%2Fesco%2Fskill%2F00735755-adc6-4ea0-b034-b8caff339c9f&conceptLanguage=en&full=true"));
}

有人知道是什么原因造成的吗?

我还尝试了HTML Agility Pack:

var url = urls.First();
var web = new HtmlWeb();
var doc = web.Load(url);

但是doc.Textnull

3 个答案:

答案 0 :(得分:2)

标头“ Accept-Encoding:gzip”可能会向您发送带有 gzip压缩的原始数据。您必须手动解压缩输出流。例如,如果使用的是Linux Shell,则为

curl -H "Accept-Encoding: gzip" "$url" --output - | gzip -d

更好的解决方案是删除此标头。

答案 1 :(得分:1)

 using (WebClient client = new WebClient())
        {
            client.Encoding = Encoding.UTF8;
            client.Headers.Add("Host", "ec.europa.eu");
            client.Headers.Add("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv,65.0) Gecko/20100101 Firefox/65.0");
            client.Headers.Add("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8");
            client.Headers.Add("Accept-Language", "pl,en-US;q=0.7,en;q=0.3");
            client.Headers.Add("Accept-Encoding", "gzip, deflate, br");
            client.Headers.Add("DNT", "1");
            client.Headers.Add("Cookie", "JSESSIONID=-(...); escoLanguage=en");
            var downloadStr = client.DownloadData(new Uri("https://ec.europa.eu/esco/portal/skill?uri=http%3A%2F%2Fdata.europa.eu%2Fesco%2Fskill%2F00735755-adc6-4ea0-b034-b8caff339c9f&conceptLanguage=en&full=true"));

            MemoryStream stream = new MemoryStream();
            using (GZipStream g = new GZipStream(new MemoryStream(downloadStr), CompressionMode.Decompress))
            {

                g.CopyTo(stream);


            }

            var output=  Encoding.UTF8.GetString(stream.ToArray());
        }

由于输出已压缩,因此看起来像这样,因此使用gzip进行未压缩。

答案 2 :(得分:0)

删除:{{1}}是WebClient的解决方案