我遇到了最奇怪的情况,其中网站(http://seventhgeneration.com/mission)错误地返回了404响应代码。
我正在编写一个自动测试套件,用于测试网站中的所有链接并测试它们是否已被破坏。在这种情况下,我正在测试链接到http://seventhgeneration.com/mission的网站,尽管我无法控制第七代任务页面。此页面在浏览器中工作,但它确实在网络监视器中返回404。
是否有任何技术方法可以将此页面验证为非错误页面,同时正确检测其他页面(例如https://github.com/thisShouldNotExist)为404?正如评论中提到的那样,Seventh Generation网站确实有一个针对其他损坏的网址显示的404页面:http://seventhgeneration.com/shouldNotExist
# -*- coding: utf-8 -*-
import traceback
import urllib2
import httplib
url = 'http://seventhgeneration.com/mission'
HEADERS = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36',
#'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
request = urllib2.Request(url, headers=HEADERS)
try:
response = urllib2.urlopen(request)
response_header = response.info()
print "Success: %s - %s"%(response.code, response_header)
except urllib2.HTTPError, e:
print 'urllib2.HTTPError %s - %s'%(e.code, e.headers)
except urllib2.URLError, e:
print "Unknown URLError: %s"%(e.reason)
except httplib.BadStatusLine as e:
print "Bad Status Error. (Presumably, the server closed the connection before sending a valid response)"
except Exception:
print "Unkown Exception: %s"%(traceback.format_exc())
运行时,此脚本返回:
urllib2.HTTPError 404 - Server: nginx
Content-Type: text/html; charset=utf-8
X-Drupal-Cache: HIT
Etag: "1422054308-1"
Content-Language: en
Link: </node/1523879>; rel="shortlink",</404>; rel="canonical",</node/1523879>; rel="shortlink",</404>; rel="canonical"
X-Generator: Drupal 7 (http://drupal.org)
Cache-Control: public, max-age=21600
Last-Modified: Fri, 23 Jan 2015 23:05:08 +0000
Expires: Sun, 19 Nov 1978 05:00:00 GMT
Vary: Cookie,Accept-Encoding
Content-Encoding: gzip
X-Request-ID: v-82b55230-a357-11e4-94fe-1231380988d9
X-AH-Environment: prod
Content-Length: 11441
Accept-Ranges: bytes
Date: Fri, 23 Jan 2015 23:28:17 GMT
X-Varnish: 2729940224
Age: 0
Via: 1.1 varnish
Connection: close
X-Cache: MISS
答案 0 :(得分:0)
此服务器显然不符合HTTP规范。它返回HTML中的整个网页,该网页应该是404错误发生原因的描述。你需要解决这个问题,而不是找到绕过它的方法。