我正在抓取一些数据,但是该网站通过检查cookie验证了我的请求。我得到一个“免费”页面,然后它期望我设置一些localStorage或cookie。通常会在页面加载时设置3-5个Cookie。
我不确定该怎么做或如何保存cookie。我已经看过如何在编译之前在代码中添加cookie,但是仅当我使用Firefox / Chrome浏览并从中获取cookie数据时,该方法才有效。该网站的Javascript变得晦涩难懂,因此我不能仅通过正则表达式对其进行解析。我正在使用ScrapySharp,HtmlAgilityPack和常规的旧HttpClient可互换来尝试保存cookie。
我调用GetContent
方法,需要将cookie / localStorage信息保存在某个地方,以便下次调用时可以再次使用它。
public static CookieContainer cookieContainer = new CookieContainer();
public static string GetContent(string url, string referrer= "https://www.google.com")
{
HttpClientHandler newhandler = new HttpClientHandler()
{
AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate,
CookieContainer = cookieContainer
};
var httpClient = new HttpClient(newhandler);
var httpRequestMessage = new HttpRequestMessage(HttpMethod.Get, CurrentOffice.url);
httpRequestMessage.Headers.Add("Connection", "keep-alive");
httpRequestMessage.Headers.Add("Pragma", "no-cache");
httpRequestMessage.Headers.Add("Cache-Control", "no-cache");
httpRequestMessage.Headers.Add("Upgrade-Insecure-Requests", "1");
httpRequestMessage.Headers.Add("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36");
httpRequestMessage.Headers.Add("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8");
httpRequestMessage.Headers.Add("Referer", referrer);
httpRequestMessage.Headers.Add("Accept-Encoding", "gzip, deflate");
httpRequestMessage.Headers.Add("Accept-Language", "en-GB,en-US;q=0.9,en;q=0.8");
//If I can detect them, I can add cookies here like this
cookieContainer.Add(baseAddress, new Cookie("CookieName", "cookie_value"));
var httpResponseMessage = httpClient.SendAsync(httpRequestMessage).Result;
var httpContent = httpResponseMessage.Content;
string result = httpResponseMessage.Content.ReadAsStringAsync().Result;
return result;
}
我希望有人也能解决这个问题并提出建议。另外,也许我可以尝试阻止javascript运行并将其重定向到405页。