源URL不是状态200时脚本STOPS

时间:2013-06-02 20:40:01

标签: python xml parsing xpath web-scraping

我有一个脚本(见下文)检查(或多或少)网站上的链接位置,它工作正常但是只要链接所在的源URL不是200响应就会退出,我只是想要它跳到下一个或回馈一些消息“错误”,甚至更好地给我回来的http状态代码。我需要一个快速的解决方案,如果有人能帮助我,这将是非常棒的:)

URLs.csv =包含指向某个页面的链接的网站列表
domain.com =要检查链接是否存在的域名,如果是,则大致位于哪里。

import csv
from lxml import html

with open('URLs.csv', 'r') as csvfile:
    urls = [row[0] for row in csv.reader(csvfile)]

for url in urls:
    print url

    doc = html.parse(url)
    if doc.xpath('//a[contains(@href,"domain.com")]'):
        for anchor_node in doc.xpath('//a[contains(@href,"finanzen.de")]'):
            if anchor_node.xpath('./ancestor::div[contains(@class, "sidebar")]'):
                print 'Sidebar'
            elif anchor_node.xpath('./parent::div[contains(@class, "widget")]'):
                print 'Sidebar'
            elif anchor_node.xpath('./ancestor::div[contains(@id, "sidebar")]'):
                print 'Sidebar'            
            elif anchor_node.xpath('./ancestor::div[contains(@class, "comment")]'):
                print 'Kommentar'
            elif anchor_node.xpath('./ancestor::div[contains(@id, "comment")]'):
                print 'Kommentar'
            elif anchor_node.xpath('./ancestor::div[contains(@class, "foot")]'):
                print "Footer"
            elif anchor_node.xpath('./ancestor::div[contains(@id, "foot")]'):
                print "Footer"
            elif anchor_node.xpath('./ancestor::div[contains(@class, "post")]'):
                print "Contextual"
            else:
                print 'Unidentified Link'
        else:
            print 'Link is Dead'

Python的壳

            Python 2.7.4 (default, Apr  6 2013, 19:55:15) [MSC v.1500 64 bit (AMD64)]
            Type "help", "copyright", "credits" or "license" for more information.
            [evaluate Linkidentifizierung.py]
            http://urlnotworking.com/broken.html
            Rückverfolgung (innerste zuletzt):
            File "C:\Program Files (x86)\Wing IDE 101 4.1\src\debug\tserver\_sandbox.py",     line 11, in <module>
            File "C:\Python27\Lib\site-packages\lxml\html\__init__.py", line 735, in parse
        return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
            File "C:\Python27\Lib\site-packages\lxml\etree.pyd", line 3197, in  lxml.etree.parse (src\lxml\lxml.etree.c:64726)
       H‹GH‹ÏÿP0H…ÛtHƒÿu
            File "C:\Python27\Lib\site-packages\lxml\etree.pyd", line 1571, in lxml.etree._parseDocument (src\lxml\lxml.etree.c:92363)
            File "C:\Python27\Lib\site-packages\lxml\etree.pyd", line 1600, in lxml.etree._parseDocumentFromURL (src\lxml\lxml.etree.c:92647)
            File "C:\Python27\Lib\site-packages\lxml\etree.pyd", line 1500, in lxml.etree._parseDocFromFile (src\lxml\lxml.etree.c:91710)
            File "C:\Python27\Lib\site-packages\lxml\etree.pyd", line 1047, in lxml.etree._BaseParser._parseDocFromFile (src\lxml\lxml.etree.c:88610)
            File "C:\Python27\Lib\site-packages\lxml\etree.pyd", line 577, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:84019)
            File "C:\Python27\Lib\site-packages\lxml\etree.pyd", line 676, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:85122)
           File "C:\Python27\Lib\site-packages\lxml\etree.pyd", line 614, in lxml.etree._raiseParseError (src\lxml\lxml.etree.c:84417)
    IOError: Error reading file 'http://urlnotworking.com/broken.html':  failed to load HTTP resource

1 个答案:

答案 0 :(得分:0)

print url

try:
  html.parse(url)
except Exception, e:
  print "something went wrong: %s" % e
  continue

if doc.xpath....

我对lxml.html库不太熟悉,看看你是否可以获得更详细的URL无法加载的原因。 (提示 - 我会使用“请求”库来加载我的URL,然后将结果传递给lxml解析器。)