Question

如何绕过缺失的链接并继续抓取好的数据？

我正在使用Python2和Ubuntu 14.04.3。

我正在抓取一个网页，其中包含指向关联数据的多个链接。一些相关的链接丢失了，所以我需要一种方法绕过丢失的链接并继续抓取。

Web page 1
    part description 1 with associated link
    part description 2 w/o associated link
    more part descriptions with and w/o associcated links
Web page n+
    more part descriptions

我试过了：

try:
    Do some things.
    Error caused by missing link.

except Index Error as e:
    print "I/O error({0}): {1}".format(e.errno, e.strerror)
    break # to go on to next link. 
    # Did not work because program stopped to report error!

由于网页上缺少链接，因此如果缺少链接语句则无法使用。

再次感谢您的帮助!!!

Answer 1

也许你正在寻找这样的东西：

import urllib

def get_content_safe(url):
    try:
        contents = urllib.open(url)
        return contents
    except IOError, ex:
        # Report ex your way
        return None

def scrap:
    # ....
    content = get_content_safe(url)
    if content == None:
        pass # or continue or whatever
    # ....

长话短说，就像Basilevs所说的那样，当你捕获异常时，你的代码不会破坏并会继续执行。

Answer 2

我通过遵循Python 2文档纠正了我的错误，除了错误。除了纠正错误的网站缺失链接，并继续刮取数据。

除了更正：

    except:
        # catch AttributeError: 'exceptions.IndexError' object has no attribute 'errno'
        e = sys.exc_info()[0]
        print "Error: %s" % e
        break

我会查看发给我的问题的答案。

再次感谢您的帮助！

如何绕过丢失的链接并继续抓取好的数据？

2 个答案: