IronWebScraper刮之前需要先登录

时间:2018-07-29 09:33:25

标签: c# web-scraping session-cookies setcookie

我想抓取需要首先进行身份验证的页面(https://www.cdc.co.nz/products/list.html?cat=5201)(通过此链接https://www.cdc.co.nz/login/)。在访问了多个SO链接之后,我到达了下面的代码。

到目前为止,以下代码已允许我成功登录。但是,我怀疑我没有将所有cookie信息都转移到IronWebScraper,以至于在我收到错误消息时,它知道足够的“身份验证信息”来抓取所需的页面:

ProductScraperFactory严重,Http:8次尝试后,URL永久失败:https://www.cdc.co.nz/products/list.html?cat=5201

var cookieContainer = new CookieContainer();
    HttpClientHandler handler;

    using (handler = new HttpClientHandler()
    {
        CookieContainer = cookieContainer
    })

    //Let's login first
    using (HttpClient client = new HttpClient(handler))
    {
        string urlToPost = "https://www.cdc.co.nz/login/";

        HttpContent stringContent = new StringContent("username=USERNAME&password=hunter2");

        HttpResponseMessage response = null;

        //Yes this works - getting a 200 status code
        Task.Run(async () => response = await client.PostAsync(urlToPost, stringContent)).GetAwaiter().GetResult();

        var headerValues = response.Headers.ToList();

        HttpIdentity identity = new HttpIdentity {UseCookies = true};
        foreach (var headerKV in headerValues)
        {
            identity.HttpRequestHeaders.Add(headerKV.Key, headerKV.Value.ToArray()[0]);
        }

        Uri uri = new Uri("https://www.cdc.co.nz/login/");

        var cookieValue = headerValues.Where(c => c.Key == "Set-Cookie").Select(c => c).ToArray()[0].Value.ToArray()[0];

        identity.Cookies.SetCookies(uri, cookieValue);

        identity.Cookies.Add(cookieContainer.GetCookies(uri));

        identity.UserAgent =
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36";

        Request request = new Request {Identity = identity};

        //Let's see if we can see any Duracell
        var scraper = new ProductScraperFactory("https://www.cdc.co.nz/products/list.html?cat=5201", ScrapingOperation.CDC, request);
        scraper.Start();

        Console.ReadLine();
    }

我的ProductScraperFactory,它是从IronWebScraper的WebScraper类扩展而来的:

    private Request _request;
public ProductScraperFactory(string URLToScrap, ScrapingOperation operation, Request request)
        {
            _scrapingOperation = operation;
            _urlToScrap = URLToScrap;

            ChooseIdentityForRequest(request);
            _request = request;
        }


        public override void Init()
        {
            Request(_urlToScrap, Parse, _request.Identity);
        }

        public override void Parse(Response response)
        { ...}

有关ChooseIdentityForRequest的文档:https://ironsoftware.com/csharp/webscraper/help/html/2498dbf0-8d85-70bf-4d82-a748be9a3a51.htm

根据请求提供文档:https://ironsoftware.com/csharp/webscraper/help/html/f28b6dc8-939c-dd94-3534-bada24edc1fa.htm

0 个答案:

没有答案