Question

我正在使用PhantomJS在python selenium框架中抓取一个URL池。

首先，我指定用户代理。

dcap = dict(DesiredCapabilities.PHANTOMJS)           
dcap["phantomjs.page.settings.userAgent"] = (
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 "
    "(KHTML, like Gecko) Chrome/15.0.87")

接下来，我创建了PhantomJS实例。

def create_phantomJS():                             
    driver = webdriver.PhantomJS("phantomjs.exe", desired_capabilities=dcap)
    return driver

然后，我从网站上获得所有可见的文字（快速和肮脏）：

def use_driver(driver, URL):
    website = driver.get(URL) # this line is needed, the highlighting is a bug in 
    html = WebDriverWait(driver, 1).until(EC.presence_of_element_located((By.XPATH, ".//html")))
    text = (str(html.text.encode("utf-8",'ignore')))
    return text

将URL设置为"https://www.whoishostingthis.com/tools/user-agent/"时，输出包含以下信息：

Your User Agent is:
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 (KHTML, like Gecko) Chrome/15.0.87

这确认设置正常。但是，我感兴趣的网站返回：

Access Denied
You don't have permission to access "URL" on this server.

相比之下，这适用于我不明白的geckodriver。此外，使用BeautifulSoup或requests只返回robots.txt内容，这意味着DOM端的某些内容能够专门检测PhantomJS

Q1：尽管有不同的用户代理，如何检测PhantomJS？

Q2：为了克服这种检查，Geckodriver有什么不同？

问题3：理论上可以规避此检查吗？

请注意，我有兴趣了解两个驱动程序实例的区别，而不是做一些可疑的事情。

网站如何使用Mozilla用户代理标记检测和阻止PhantomJS？

0 个答案: