我正在用一个带有selenium chrome web驱动程序的动态网站上搜索scrapy蜘蛛。但最近我发现我的蜘蛛开始被网站阻止。当我运行代码进行代码测试和调试时,我的代码只下载了一两页。打印的页面源如下:
<!DOCTYPE html><html xmlns="http://www.w3.org/1999/xhtml"><head>
<meta name="ROBOTS" content="NOINDEX, NOFOLLOW" />
<meta http-equiv="cache-control" content="max-age=0" />
<meta http-equiv="cache-control" content="no-cache" />
<meta http-equiv="expires" content="0" />
<meta http-equiv="expires" content="Tue, 01 Jan 1980 1:00:00 GMT" />
<meta http-equiv="pragma" content="no-cache" />
<meta http-equiv="refresh" content="10; url=/distil_r_captcha.html?requestId=73bf85fe-a3e7-4dc8-9284-1605c4cd82f3&httpReferrer=%2Fmyytavat-uudisasunnot%3FcardType%3D100%26locations%3D%255B%2522helsinki%2522%255D%26newDevelopment%3D1%26buildingType%255B%255D%3D1%26buildingType%255B%255D%3D256%26pagination%3D1" />
<script type="text/javascript">
(function(window){
try {
if (typeof sessionStorage !== 'undefined'){
sessionStorage.setItem('distil_referrer', document.referrer);
}
} catch (e){}
})(window);
</script>
<script type="text/javascript" src="/dstlsnm.js" defer=""></script><style type="text/css">#d__fFH{position:absolute;top:-5000px;left:-5000px}#d__fF{font-family:serif;font-size:200px;}#dfdretxfwc{display:none!important}</style></head>
<body>
<div id="distilIdentificationBlock"> </div>
<div id="d__fFH" style="position: absolute; top: -5000px; left: -5000px;"><object id="d_dlg" classid="clsid:3050f819-98b5-11cf-bb82-00aa00bdce0b" width="0px" height="0px"></object><span id="d__fF" style="font-family: Courier, serif; font-size: 72px; ">The quick brown fox jumps over the lazy dog.</span></div></body></html>
当我的蜘蛛被阻挡时,我能够浏览网页,这是有线的。蜘蛛被阻止大约十五分钟,它能够再次下载页面源。
我尝试做的是添加用户代理,如下所示:
chromeOptions.add_argument('--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36')
self.driver = webdriver.Chrome(executable_path=chrome_driver, chrome_options=chromeOptions)
但它似乎无法克服这个问题(但我仍然不理解它的意义,因为selenium使用chrome访问网页源,应该有&#39 ; user-agent&#39;参数在其默认设置中)。任何关于Web服务器如何识别我的蜘蛛而不下载大量页面的建议?