加载首页时检测到硒刮刀

时间:2019-05-14 00:11:54

标签: python selenium web-scraping

我正在尝试您抓取此网站:https://www.zocdoc.com/

第一人尝试使用请求库,并从站点获得以下响应:

b'<html style="height:100%"><head><META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"><meta name="format-detection" content="telephone=no"><meta name="viewport" content="initial-scale=1.0"><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"></head><body style="margin:0px;height:100%"><iframe src="/_Incapsula_Resource?CWUDNSAI=20&xinfo=13-8874904-0%200NNN%20RT%281557792003687%20128%29%20q%280%20-1%20-1%20-1%29%20r%280%20-1%29%20B15%284%2c200%2c0%29%20U5&incident_id=787000970007113277-35368596172637725&edet=15&cinfo=04000000" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 787000970007113277-35368596172637725</iframe></body></html>'

因此,我切换到通常可以使用的硒。我使用以下简单代码对其进行测试:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
driver = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver")
url = "https://www.zocdoc.com/"
driver.get(url)

但这也不起作用,我得到了这个结果:

enter image description here

该站点如何快速检测到机器人?

1 个答案:

答案 0 :(得分:1)

如发布的图像所示,该站点在Imperva WAF(Web应用程序防火墙)或相关产品的后面受到保护。

如果您对站点进行ping操作,则会看到所有请求都通过与Imperva相关的地址。

ping www.zocdoc.com
Pinging ux639.x.incapdns.net [45.60.62.232] with 32 bytes of data:
Reply from 45.60.62.232: bytes=32 time=46ms TTL=59
Reply from 45.60.62.232: bytes=32 time=47ms TTL=59
Reply from 45.60.62.232: bytes=32 time=46ms TTL=59
Reply from 45.60.62.232: bytes=32 time=46ms TTL=59

如您所见,对 www.zocdoc.com 进行ping操作可通过 incapdns.net 命名空间重定向,根据WHOIS,该命名空间由Imperva拥有公司

关于检测的工作方式。我认为以下帖子已涵盖了该问题:Can a website detect when you are using selenium with chromedriver?