Question

我有一个简单的python爬虫/蜘蛛，它在我提供的网站上搜索指定的文本。但在某些网站中，它会正常爬行2-4秒，直到发生错误。

到目前为止的代码：

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import print_function
import requests, pyquery, urlparse

try:
    range = xrange
except NameError:
    pass

def crawl(seed, depth, terms):

    crawled = set()
    uris = set([seed])
   for level in range(depth):
       new_uris = set()
       for uri in uris:
          if uri in crawled:
               continue
          crawled.add(uri)
          # Get URI contents
          try:
              content = requests.get(uri).content
         except:
             continue
         # Look for the terms
         found = 0
         for term in terms:
             if term in content:
                 found += 1
          if found > 0:
              yield (uri, found, level + 1)
          # Find child URIs, and add them to the new_uris set
          dom = pyquery.PyQuery(content)
          for anchor in dom('a'):
              try:
                 link = anchor.attrib['href']
             except KeyError:
                    continue
                new_uri = urlparse.urljoin(uri, link)
                new_uris.add(new_uri)
        uris = new_uris

if __name__ == '__main__':
    import sys
    if len(sys.argv) < 4:
        print('usage: ' + sys.argv[0] + 
            "start_url crawl_depth term1 [term2 [...]]")
       print('       ' + sys.argv[0] + 
           " http://yahoo.com 5 cute 'fluffy kitties'")
       raise SystemExit

 seed_uri = sys.argv[1]
 crawl_depth = int(sys.argv[2])
 search_terms = sys.argv[3:]

 for uri, count, depth in crawl(seed_uri, crawl_depth, search_terms):
     print(uri)

现在让我们说我想找到所有在源代码中都有“requireLazy（”）的页面。让我们试试吧，如果我执行此：

python crawler.py https://www.facebook.com 4 '<script>requireLazy('

它将运行正常2-4秒，并且会发生此错误：

https://www.facebook.com
https://www.facebook.com/badges/?ref=pf
https://www.facebook.com/appcenter/category/music/?ref=pf
https://www.facebook.com/legal/terms
https://www.facebook.com/
...

Traceback (most recent call last):
  File "crawler.py", line 61, in <module>
   for uri, count, depth in crawl(seed_uri, crawl_depth, search_terms):
File "crawler.py", line 38, in crawl
  dom = pyquery.PyQuery(content)
File "/usr/local/lib/python2.7/dist-packages/pyquery/pyquery.py", line 226, in __init__
  elements = fromstring(context, self.parser)
File "/usr/local/lib/python2.7/dist-packages/pyquery/pyquery.py", line 70, in fromstring
  result = getattr(lxml.html, meth)(context)
File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 634, in fromstring
   doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 532, in document_fromstring
  value = etree.fromstring(html, parser, **kw)
File "lxml.etree.pyx", line 2754, in lxml.etree.fromstring (src/lxml/lxml.etree.c:54631)
File "parser.pxi", line 1578, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:82748)
File "parser.pxi", line 1457, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:81546)
File "parser.pxi", line 965, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:78216)
File "parser.pxi", line 569, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74472)
File "parser.pxi", line 650, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75363)
 File "parser.pxi", line 599, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74827) lxml.etree.XMLSyntaxError: line 21: Tag fb:like invalid

任何人都可以帮我修复此错误吗？感谢。

Answer 1

似乎您尝试解析的页面内容包含一些无效标记。通常，您可以做的最好的事情是捕获并记录此类错误，并优雅地前进到下一页。

希望你可以使用BeautifulSoup来提取下一页要抓取的网址，它会优雅地处理大部分不良内容。您可以找到有关BeatifulSoup以及如何使用它的更多详细信息here。

更新

实际上在使用爬虫后，似乎在某些时候页面内容为空，因此解析器无法加载文档。

我使用BeautifoulSoup测试了爬虫并且它正常工作。如果您需要/希望我可以与您分享我的更新版本。

您可以轻松添加对空内容的检查，但我不确定您可能遇到的其他边缘情况，因此切换到BeautifulSoup似乎是一种更安全的方法。

简单的Python Crawler / Spider运行时错误

1 个答案:

更新