我正在编写脚本来检查大量的URL并返回每个URL的HTTP状态代码。我尝试了我能想到的一切,或者在网上寻找异常处理。该脚本运行一段时间,然后最终崩溃并出现错误:
requests.exceptions.ConnectionError: HTTPConnectionPool(host='10.10.10.10', port=80): Max retries exceeded with url: /wmedia (Caused by NewConnectionError("<urllib3.connection.HTTPConnection object at 0x1029bfe10>: Failed to establish a new connection: [Errno 49] Can't assign requested address",))
我认为服务器在一段时间后会被过多的请求所淹没,而且睡眠时间也没有帮助。
这是我使用进程池的工作函数:
def get(url):
r = requests.get(url, timeout=2)
try:
r.raise_for_status()
except requests.exceptions.HTTPError as err:
print(err)
pass
except requests.ConnectionError as e:
print("OOPS!! Connection Error")
r.status_code = "Connection refused"
time.sleep(2)
print(str(e))
except requests.Timeout as e:
print("OOPS!! Timeout Error")
r.status_code = "Timed out"
time.sleep(2)
print(str(e))
except requests.RequestException as e:
print("OOPS!! General Error")
r.status_code = "Error"
print(str(e))
except KeyboardInterrupt:
print("Someone closed the program")
r.status_code = "Interrupted"
except Exception as e:
print(e)
r.status_code = "Error"
return param, r.status_code
有什么建议吗?
答案 0 :(得分:1)
您可以使用sample_weights
获取HTTP状态代码。 This site将所有可能的HTTP状态代码用逗号分隔(我在下面的示例中将其用作' httpStatusCodes.txt ')。
urllib
因此,我们读取了dict中的所有状态代码,并在代码不可用时进行了规定。
import urllib
from collections import defaultdict
adict = {}
with open("httpStatusCodes.txt") as f:
for line in f:
line = line.rstrip()
(key,val) = line.split(',')
adict[int(key)] = val
然后我们遍历网站列表并获取其状态代码。
adict = defaultdict(lambda: "'Code not defined'", adict)
请注意,我故意列出websites = ['facebook.com', 'twitter.com', 'google.com',
'youtube.com', 'icantfindthiswebsite.com']
for url in websites:
try:
code = urllib.urlopen('http://' +url).getcode()
except IOError:
code = None
print "url = {}, code = {}, status = {}".format(url, code, adict[code])
来模拟无法访问的网站。此异常由icantfindthiswebsite.com
处理。
结果
IOError: