PhantomJS返回空网页(python,Selenium)

时间:2015-04-05 23:54:32

标签: python selenium selenium-webdriver phantomjs

尝试屏幕抓取网站而不必在python脚本中启动实际的浏览器实例(使用Selenium)。我可以用Chrome或Firefox做到这一点 - 我已经尝试过它并且它有效 - 但是我想使用PhantomJS让它无头。

代码如下所示:

import sys
import traceback
import time

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 "
    "(KHTML, like Gecko) Chrome/15.0.87"
)

try:
    # Choose our browser
    browser = webdriver.PhantomJS(desired_capabilities=dcap)
    #browser = webdriver.PhantomJS()
    #browser = webdriver.Firefox()
    #browser = webdriver.Chrome(executable_path="/usr/local/bin/chromedriver")

    # Go to the login page
    browser.get("https://www.whatever.com")

    # For debug, see what we got back
    html_source = browser.page_source
    with open('out.html', 'w') as f:
        f.write(html_source)

    # PROCESS THE PAGE (code removed)

except Exception, e:
    browser.save_screenshot('screenshot.png')
    traceback.print_exc(file=sys.stdout)

finally:
    browser.close()

输出仅仅是:

<html><head></head><body></body></html>

但是当我使用Chrome或Firefox选项时,它运行正常。我想也许这个网站根据用户代理返回垃圾,所以我试着把它伪装掉。没有区别。

我错过了什么?

更新:我将尝试更新以下代码段,直到它正常工作。以下是我目前正在尝试的内容。

import sys
import traceback
import time
import re

from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.support import expected_conditions as EC

dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 (KHTML, like Gecko) Chrome/15.0.87")

try:
    # Set up our browser
    browser = webdriver.PhantomJS(desired_capabilities=dcap, service_args=['--ignore-ssl-errors=true'])
    #browser = webdriver.Chrome(executable_path="/usr/local/bin/chromedriver")

    # Go to the login page
    print "getting web page..."
    browser.get("https://www.website.com")

    # Need to wait for the page to load
    timeout = 10
    print "waiting %s seconds..." % timeout
    wait = WebDriverWait(browser, timeout)
    element = wait.until(EC.element_to_be_clickable((By.ID,'the_id')))
    print "done waiting. Response:"

    # Rest of code snipped. Fails as "wait" above.

3 个答案:

答案 0 :(得分:29)

我遇到了同样的问题,没有多少代码可以让驾驶员等待帮助 问题是https网站上的SSL加密,忽略它们就可以解决问题。

将PhantomJS驱动程序称为:

driver = webdriver.PhantomJS(service_args=['--ignore-ssl-errors=true', '--ssl-protocol=TLSv1'])

这解决了我的问题。

答案 1 :(得分:3)

你需要等待页面加载 d。通常,通过使用Explicit Wait等待关键元素在页面上显示或可见来完成。例如:

from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC


# ...
browser.get("https://www.whatever.com")

wait = WebDriverWait(driver, 10)
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.content")))

html_source = browser.page_source
# ...

在此处,我们等待最多10秒,以便div元素在class="content"获取页面来源之前可见。


此外,您可能需要忽略SSL错误

browser = webdriver.PhantomJS(desired_capabilities=dcap, service_args=['--ignore-ssl-errors=true'])

尽管如此,我确信这与PhantomJS中的重定向问题有关。 phantomjs bugtracker中有一张开放票:

答案 2 :(得分:0)

driver = webdriver.PhantomJS(service_args = ['-ignore-ssl-errors = true','--ssl-protocol = TLSv1'])

这对我有用