我想抓取需要首先进行身份验证的页面(https://www.cdc.co.nz/products/list.html?cat=5201)(通过此链接https://www.cdc.co.nz/login/)。在访问了多个SO链接之后,我到达了下面的代码。
到目前为止,以下代码已允许我成功登录。但是,我怀疑我没有将所有cookie信息都转移到IronWebScraper,以至于在我收到错误消息时,它知道足够的“身份验证信息”来抓取所需的页面:
ProductScraperFactory严重,Http:8次尝试后,URL永久失败:https://www.cdc.co.nz/products/list.html?cat=5201 。
var cookieContainer = new CookieContainer();
HttpClientHandler handler;
using (handler = new HttpClientHandler()
{
CookieContainer = cookieContainer
})
//Let's login first
using (HttpClient client = new HttpClient(handler))
{
string urlToPost = "https://www.cdc.co.nz/login/";
HttpContent stringContent = new StringContent("username=USERNAME&password=hunter2");
HttpResponseMessage response = null;
//Yes this works - getting a 200 status code
Task.Run(async () => response = await client.PostAsync(urlToPost, stringContent)).GetAwaiter().GetResult();
var headerValues = response.Headers.ToList();
HttpIdentity identity = new HttpIdentity {UseCookies = true};
foreach (var headerKV in headerValues)
{
identity.HttpRequestHeaders.Add(headerKV.Key, headerKV.Value.ToArray()[0]);
}
Uri uri = new Uri("https://www.cdc.co.nz/login/");
var cookieValue = headerValues.Where(c => c.Key == "Set-Cookie").Select(c => c).ToArray()[0].Value.ToArray()[0];
identity.Cookies.SetCookies(uri, cookieValue);
identity.Cookies.Add(cookieContainer.GetCookies(uri));
identity.UserAgent =
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36";
Request request = new Request {Identity = identity};
//Let's see if we can see any Duracell
var scraper = new ProductScraperFactory("https://www.cdc.co.nz/products/list.html?cat=5201", ScrapingOperation.CDC, request);
scraper.Start();
Console.ReadLine();
}
我的ProductScraperFactory,它是从IronWebScraper的WebScraper类扩展而来的:
private Request _request;
public ProductScraperFactory(string URLToScrap, ScrapingOperation operation, Request request)
{
_scrapingOperation = operation;
_urlToScrap = URLToScrap;
ChooseIdentityForRequest(request);
_request = request;
}
public override void Init()
{
Request(_urlToScrap, Parse, _request.Identity);
}
public override void Parse(Response response)
{ ...}
有关ChooseIdentityForRequest的文档:https://ironsoftware.com/csharp/webscraper/help/html/2498dbf0-8d85-70bf-4d82-a748be9a3a51.htm
根据请求提供文档:https://ironsoftware.com/csharp/webscraper/help/html/f28b6dc8-939c-dd94-3534-bada24edc1fa.htm