在C#中进行Web爬网时检测并保存Cookies / localStorage

时间:2019-09-19 19:31:55

标签: c# cookies web-scraping

我正在抓取一些数据,但是该网站通过检查cookie验证了我的请求。我得到一个“免费”页面,然后它期望我设置一些localStorage或cookie。通常会在页面加载时设置3-5个Cookie。

我不确定该怎么做或如何保存cookie。我已经看过如何在编译之前在代码中添加cookie,但是仅当我使用Firefox / Chrome浏览并从中获取cookie数据时,该方法才有效。该网站的Javascript变得晦涩难懂,因此我不能仅通过正则表达式对其进行解析。我正在使用ScrapySharp,HtmlAgilityPack和常规的旧HttpClient可互换来尝试保存cookie。

我调用GetContent方法,需要将cookie / localStorage信息保存在某个地方,以便下次调用时可以再次使用它。

public static CookieContainer cookieContainer = new CookieContainer();
public static string GetContent(string url, string referrer= "https://www.google.com")
{
    HttpClientHandler newhandler = new HttpClientHandler()
    {
        AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate,
        CookieContainer = cookieContainer
    };

    var httpClient = new HttpClient(newhandler);
    var httpRequestMessage = new HttpRequestMessage(HttpMethod.Get, CurrentOffice.url);
    httpRequestMessage.Headers.Add("Connection", "keep-alive");
    httpRequestMessage.Headers.Add("Pragma", "no-cache");
    httpRequestMessage.Headers.Add("Cache-Control", "no-cache");
    httpRequestMessage.Headers.Add("Upgrade-Insecure-Requests", "1");
    httpRequestMessage.Headers.Add("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36");
    httpRequestMessage.Headers.Add("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8");
    httpRequestMessage.Headers.Add("Referer", referrer);
    httpRequestMessage.Headers.Add("Accept-Encoding", "gzip, deflate");
    httpRequestMessage.Headers.Add("Accept-Language", "en-GB,en-US;q=0.9,en;q=0.8");

     //If I can detect them, I can add cookies here like this
    cookieContainer.Add(baseAddress, new Cookie("CookieName", "cookie_value"));

    var httpResponseMessage = httpClient.SendAsync(httpRequestMessage).Result;
    var httpContent = httpResponseMessage.Content;
    string result = httpResponseMessage.Content.ReadAsStringAsync().Result;
    return result;
}

我希望有人也能解决这个问题并提出建议。另外,也许我可以尝试阻止javascript运行并将其重定向到405页。

0 个答案:

没有答案