如何使用selenium.py(python代码)获取状态代码

时间:2011-04-27 04:08:50

标签: python selenium

我正在编写python的selenium脚本,但我想我没有看到任何关于的信息:

如何从selenium Python代码中获取http状态代码

或者我错过了什么。如果有人发现,请随时发布。

12 个答案:

答案 0 :(得分:24)

这是不可能的。

不幸的是,Selenium没有按设计提供此信息。关于此问题进行了a very lengthy讨论,但缺点是:

  1. Selenium是一个浏览器仿真工具,不一定是测试工具。
  2. Selenium在呈现页面的过程中执行许多 GET和POST,并为其添加接口会使作者所抵制的API复杂化。
  3. 我们留下了像黑客这样的黑客:

    1. 在返回的HTML中查找错误信息。
    2. 使用其他工具代替请求(但请参阅@ Zeinab答案中该方法的缺点。

答案 1 :(得分:8)

我对python没有多少经验。我在这里有一个更详细的java示例:

https://stackoverflow.com/a/39979509/5703420

这个想法是启用性能日志记录。这是触发" Network.enable"在chromedriver上。然后获取性能日志条目并解析它们" Network.responseReceived"消息。

    from selenium import webdriver

    from selenium.webdriver.common.desired_capabilities import DesiredCapabilities    
    # enable browser logging
    d = DesiredCapabilities.CHROME
    d['loggingPrefs'] = { 'performance':'ALL' }

    driver = webdriver.Chrome(executable_path="c:\\windows\\chromedriver.exe", service_args=["--verbose", "--log-path=D:\\temp3\\chromedriverxx.log"], desired_capabilities=d)

    driver.get('https://api.ipify.org/?format=text')

    print(driver.title)

    print(driver.page_source)

    performance_log = driver.get_log('performance')
    print (str(performance_log).strip('[]'))

    for entry in driver.get_log('performance'):
        print (entry)

输出将包含" Network.responseReceived"对于您的网址,其他请求由网页加载或重定向网址完成。您所要做的就是解析日志条目。

'{"message":{"method":"Network.responseReceived","params":{"frameId":"9488.1","loaderId":"9488.1","requestId":"9488.1","response":{"connectionId":14,"connectionReused":false,"encodedDataLength":-1,"fromDiskCache":false,"fromServiceWorker":false,"headers":{"Connection":"keep-alive","Content-Length":"13","Content-Type":"text/plain","Date":"Wed, 12 Oct 2016 06:15:47 GMT","Server":"Cowboy","Via":"1.1 vegur"},"headersText":"HTTP/1.1 200 OK\\r\\nServer: Cowboy\\r\\nConnection: keep-alive\\r\\nContent-Type: text/plain\\r\\nDate: Wed, 12 Oct 2016 06:15:47 GMT\\r\\nContent-Length:13\\r\\nVia:1.1vegur\\r\\n\\r\\n","mimeType":"text/plain","protocol":"http/1.1","remoteIPAddress":"54.197.246.207","remotePort":443,"requestHeaders":{"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8","Accept-Encoding":"gzip, deflate, sdch, br","Accept-Language":"en-GB,en-US;q=0.8,en;q=0.6","Connection":"keep-alive","Host":"api.ipify.org","Upgrade-Insecure-Requests":"1","User-Agent":"Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36"},"requestHeadersText":"GET /?format=text HTTP/1.1\\r\\nHost: api.ipify.org\\r\\nConnection: keep-alive\\r\\nUpgrade-Insecure-Requests: 1\\r\\nUser-Agent: Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36\\r\\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8\\r\\nAccept-Encoding: gzip, deflate, sdch, br\\r\\nAccept-Language: en-GB,en-US;q=0.8,en;q=0.6\\r\\n\\r\\n","securityDetails":{"certificateId":1,"certificateValidationDetails":{"numInvalidScts":0,"numUnknownScts":0,"numValidScts":0},"cipher":"AES_128_GCM","keyExchange":"ECDHE_RSA","protocol":"TLS 1.2","signedCertificateTimestampList":[]},"securityState":"secure","status":200,"statusText":"OK","timing":{"connectEnd":320.508999997401,"connectStart":3.08100000256673,"dnsEnd":3.08100000256673,"dnsStart":0,"proxyEnd":-1,"proxyStart":-1,"pushEnd":0,"pushStart":0,"receiveHeadersEnd":465.725000001839,"requestTime":78246.775045,"sendEnd":320.995999994921,"sendStart":320.825999995577,"sslEnd":320.435000001453,"sslStart":141.675999999279,"workerReady":-1,"workerStart":-1},"url":"https://api.ipify.org/?format=text"},"timestamp":78247.242716,"type":"Document"}},"webview":"6e8a3b1d-e5aa-40fb-a695-280cbb0ee420"}'}, {'timestamp': 1476252948094, 'level': 'INFO', 'message': '{"message":{"method":"Network.dataReceived","params":{"dataLength":13,"encodedDataLength":171,"requestId":"9488.1","timestamp":78247.243137}},"webview":"6e8a3b1d-e5aa-40fb-a695-280cbb0ee420"}'}, {'timestamp': 1476252948094, 'level': 'INFO', 'message': '{"message":{"method":"Page.frameNavigated","params":{"frame":{"id":"9488.1","loaderId":"9488.1","mimeType":"text/plain","securityOrigin":"https://api.ipify.org","url":"https://api.ipify.org/?format=text"}}},"webview":"6e8a3b1d-e5aa-40fb-a695-280cbb0ee420"}'}, {'timestamp': 1476252948095, 'level': 'INFO', 'message': '{"message":{"method":"Network.loadingFinished","params":{"encodedDataLength":171,"requestId":"9488.1","timestamp":78247.242066}},"webview":"6e8a3b1d-e5aa-40fb-a695-280cbb0ee420"}'}, {'timestamp': 1476252948115, 'level': 'INFO', 'message': '{"message":{"method":"Page.loadEventFired","params":{"timestamp":78247.264169}},"webview":"6e8a3b1d-e5aa-40fb-a695-280cbb0ee420"}'}, {'timestamp': 1476252948115, 'level': 'INFO', 'message': '{"message":{"method":"Page.frameStoppedLoading","params":{"frameId":"9488.1"}},"webview":"6e8a3b1d-e5aa-40fb-a695-280cbb0ee420"}'}, {'timestamp': 147625298116, 'level': 'INFO', 'message': '{"message":{"method":"Page.domContentEventFired","params":{"timestamp":78247.276475}},"webview":"6e8a3b1d-e5aa-40fb-a695-280cbb0ee420"}'}, {'timestamp': 1476252948122, 'level': 'INFO', 'message': '{"message":{"method":"Network.requestWillBeSent","params":{"documentURL":"https://api.ipify.org/?format=text","frameId":"9488.1","initiator":{"type":"other"},"loaderId":"9488.1","request":{"headers":{"Referer":"https://api.ipify.org/?format=text","User-Agent":"Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36"},"initialPriority":"High","method":"GET","mixedContentType":"none","url":"https://api.ipify.org/favicon.ico"},"requestId":"9488.2","timestamp":78247.280131,"type":"Other","wallTime":1476252948.11805}},"webview":"6e8a3b1d-e5aa-40fb-a695-280cbb0ee420"}'}

并从json响应中获取" status":200。您还可以解析响应"标题"。

答案 2 :(得分:7)

我一直在网上冲浪大约3个小时,我发现用网络驱动程序没办法做到这一点。我没有直接使用过硒。我想到的唯一建议是使用模块“请求”,如下所示:

import requests
from selenium import webdriver

driver = webdriver.get("url")
r = requests.get("url")
print r.status_code

有关使用请求的完整教程是here,您可以使用命令pip install requests安装模块。

但是有一个问题可能并不总是会发生,但你应该关注那个驱动程序的响应和请求的响应是不一样的;所以你只需要获取请求的状态代码,如果网址响应不稳定,可能会导致错误的结果。

答案 3 :(得分:4)

似乎可以通过API从日志中获取响应状态代码。

from selenium import webdriver
import json
browser = webdriver.PhantomJS()
browser.get('http://www.google.fr')
har = json.loads(browser.get_log('har')[0]['message'])
har['log']['entries'][0]['response']['status']
har['log']['entries'][0]['response']['statusText']

答案 4 :(得分:2)

我将引用您之前提到的问题:How to detect when Selenium loads a browser's error page

除此之外,除非你想得到像鱿鱼代理或浏览器之类的东西,否则你必须采用如下的脏解决方案。

替换

driver.get( "http://google.com" )

def goTo( url ):
    if "errorPageContainer" in [ elem.get_attribute("id") for elem in driver.find_elements_by_css_selector("body > div") ]:
        raise Exception( "this page is an error" )
    else:
        driver.get( url )

您可以根据实际浏览器中显示的文字获得创意并获取错误代码。这必须根据浏览器进行定制;上面的那个适用于firefox。

这会成为问题的唯一方法是使用404(页面未找到),因为许多网站都有自己的错误页面,您必须为每个网站自定义错误页面。

答案 5 :(得分:2)

为了使用Selenium从url获取状态代码,您可以使用javascript和<libproc/libproc.h>对象。 XMLHttpRequest类有一个WebDriver方法,您可以调用它在浏览器中执行javascript代码:

execute_async_script()

有关execute_async_script方法的更多信息。

答案 6 :(得分:1)

我在这里使用java,因为我没有太多的Python经验。另外,我不知道如何只获取http状态代码。以下将为您提供整个网络流量,您可以从中捕获状态代码。

首先启动服务器

selenium.start("captureNetworkTraffic=true");

然后将您的交通捕获为

String traffic = selenium.captureNetworkTraffic("xml");

您也可以在json中获得输出。

答案 7 :(得分:1)

Corey Goldberg使用Selenium和Python输出格式化结果,有很好的分析器实现。 这是链接。

http://coreygoldberg.blogspot.com/2009/10/automated-webhttp-profiler-with.html

答案 8 :(得分:1)

import json
from selenium.webdriver.chrome.webdriver import WebDriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

chromedriver_path = "YOUR/PATH/TO/chromedriver.exe"
url = "https://selenium-python.readthedocs.io/api.html"
capabilities = DesiredCapabilities.CHROME.copy()
capabilities['goog:loggingPrefs'] = {'performance': 'ALL'}

browser = WebDriver(chromedriver_path, desired_capabilities=capabilities)

browser.get(url)
logs = browser.get_log('performance')

选项1:如果仅在假设您要从中获取状态代码的页面的假设下返回状态代码,则该日志包含在包含'text/html内容类型的日志中

def get_status(logs):
    for log in logs:
        if log['message']:
            d = json.loads(log['message'])
            try:
                content_type = 'text/html' in d['message']['params']['response']['headers']['content-type']
                response_received = d['message']['method'] == 'Network.responseReceived'
                if content_type and response_received:
                    return d['message']['params']['response']['status']
            except:
                pass

用法:

>>> get_status(logs)
200

选项2:如果您想在相关日志中查看所有状态代码

def get_status_codes(logs):
    statuses = []
    for log in logs:
        if log['message']:
            d = json.loads(log['message'])
            if d['message'].get('method') == "Network.responseReceived":
                statuses.append(d['message']['params']['response']['status'])
    return statuses

用法:

>>> get_status_codes(logs)
[200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200]

注1:大部分内容基于@Stefan Matei答案,但是,Chrome版本之间发生了一些变化,我提供了一种如何解析日志的想法。

注释2:['content-type']并不完全可靠。外壳可能会改变。检查您的用例。

答案 9 :(得分:0)

您还可以检查日志中的最后一条消息以获取错误状态代码: ⬛⬛ ⬜⬛

答案 10 :(得分:0)

您可以从标题中获取状态代码

例如,nginx的403禁止响应。

<html>
    <head>
        <title>403 Forbidden</title>
    </head>
    <body></body>
</html>

硒代码:

text = driver.find_element_by_tag_name('title').text
if '403 Forbidden' in text:
    print('[INFO] status code is 403')

当然,此决定并不涵盖所有情况。

答案 11 :(得分:0)

我使用以下技巧,通过使用请求来确保服务器首先响应。然后我使用了驱动程序:

resp = requests.get(link)
while resp.status_code != 200:
    resp = requests.get(link)
    if resp.status_code == 200:
        break

html = driver.page_source

soup = BeautifulSoup(html)