返回Error时列表中的escape元素

时间:2017-05-09 20:46:15

标签: python-2.7 web-scraping web-crawler

为了获得更好的技能,我制作了这个脚本,所以它需要一个网站列表并创建一个字典并拿走每个网站并抓取它来找到“conatct-us”页面,但我看到我的脚本停止了当其中一个网站没有工作时,我想要做的就是逃离该网站并继续其他网站 这是我的代码:

import  requests
from  bs4 import BeautifulSoup
from urlparse import urlparse
from mechanize import Browser
import re
headers = [('User-Agent','Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:53.0) Gecko/20100101 Firefox/53.0')]

urls = 'http://www.officialusa.com/stateguides/chambers/georgia.html'
links_dict = []



response = requests.get(urls, headers)
bsObj = BeautifulSoup(response.text,'lxml')
for tag in bsObj.find_all('li'):
       links_dict.append(tag.a.get('href'))


for ink in links_dict:
                r = requests.get(ink)
                #get domain name only
                parsed_uri = urlparse(ink)
                domain = parsed_uri.netloc
                br = Browser()
                br.set_handle_robots(False)
                br.addheaders = headers
                try:
                    br.open(str(ink))
                    for link in br.links():
                            siteMatch = re.compile(ink).search(link.url)
                            print link.url
                except:
                    pass

其他链接的一切都很好 这是错误:

Traceback (most recent call last):
  File "/home/qunix/PycharmProjects/challange/crawel.py", line 20, in <module>
    r = requests.get(ink)
  File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 70, in get
    return request('get', url, params=params, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 56, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 488, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 609, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 487, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='www.quitmangeorgia.org', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7facf68cca50>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',))
谢谢你!!

1 个答案:

答案 0 :(得分:0)

尝试换行

r = requests.get(ink)

尝试捕捉如此:

try:
    r = requests.get(ink)
except ConnectionError:
    continue

这意味着如果对requests.get的调用抛出ConnectionError,就像它在您的示例中一样,它将转移到列表中的下一个网站。