从URL加载html的两种方法?

时间:2018-08-06 18:23:43

标签: html-agility-pack

要从URL加载HTML,我使用的是以下方法

public HtmlDocument DownloadSource(string url)
{
    try
    {
        HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
        doc.LoadHtml(DownloadString(url));
        return doc;
    }
    catch (Exception e)
    {
        if (Task.Error == null)
            Task.Error = e;
        Task.Status = TaskStatuses.Error;
        Done = true;
        return null;
    }
}

但是今天突然上面的代码停止工作了。我发现了另一种方法,它可以正常工作。

HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load(url.ToString());

现在我只想知道两种方法之间的区别

1 个答案:

答案 0 :(得分:1)

现在看来User-Agent标头对于your site是必需的。

HtmlAgilityPack一切都很好,但是您应该更改DownloadString(url)方法。如果您使用Fiddler检查请求,则会看到它返回403 Forbidden

enter image description here

解决方案是在请求中添加任何User-Agent标头:

using HtmlAgilityPack;
using System;
using System.Net;

class Program
{
    static void Main()
    {
        var doc = DownloadSource("https://videohive.net/item/inspired-slideshow/21544630");
        Console.ReadKey();
    }

    public static HtmlDocument DownloadSource(string url)
    {
        try
        {
            HtmlDocument doc = new HtmlDocument();
            doc.LoadHtml(DownloadString(url));
            return doc;
        }
        catch (Exception e)
        {
            // exception handling here
        }
        return null;
    }

    static String DownloadString(String url)
    {
        WebClient client = new WebClient();
        client.Headers.Add("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:x.x.x) Gecko/20041107 Firefox/x.x");
        return client.DownloadString(url);
    }
}