我有一个简单的python爬虫/蜘蛛,它在我提供的网站上搜索指定的文本。但在某些网站中,它会正常爬行2-4秒,直到发生错误。
到目前为止的代码:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import print_function
import requests, pyquery, urlparse
try:
range = xrange
except NameError:
pass
def crawl(seed, depth, terms):
crawled = set()
uris = set([seed])
for level in range(depth):
new_uris = set()
for uri in uris:
if uri in crawled:
continue
crawled.add(uri)
# Get URI contents
try:
content = requests.get(uri).content
except:
continue
# Look for the terms
found = 0
for term in terms:
if term in content:
found += 1
if found > 0:
yield (uri, found, level + 1)
# Find child URIs, and add them to the new_uris set
dom = pyquery.PyQuery(content)
for anchor in dom('a'):
try:
link = anchor.attrib['href']
except KeyError:
continue
new_uri = urlparse.urljoin(uri, link)
new_uris.add(new_uri)
uris = new_uris
if __name__ == '__main__':
import sys
if len(sys.argv) < 4:
print('usage: ' + sys.argv[0] +
"start_url crawl_depth term1 [term2 [...]]")
print(' ' + sys.argv[0] +
" http://yahoo.com 5 cute 'fluffy kitties'")
raise SystemExit
seed_uri = sys.argv[1]
crawl_depth = int(sys.argv[2])
search_terms = sys.argv[3:]
for uri, count, depth in crawl(seed_uri, crawl_depth, search_terms):
print(uri)
现在让我们说我想找到所有在源代码中都有“requireLazy(”)的页面。 让我们试试吧, 如果我执行此:
python crawler.py https://www.facebook.com 4 '<script>requireLazy('
它将运行正常2-4秒,并且会发生此错误:
https://www.facebook.com
https://www.facebook.com/badges/?ref=pf
https://www.facebook.com/appcenter/category/music/?ref=pf
https://www.facebook.com/legal/terms
https://www.facebook.com/
...
Traceback (most recent call last):
File "crawler.py", line 61, in <module>
for uri, count, depth in crawl(seed_uri, crawl_depth, search_terms):
File "crawler.py", line 38, in crawl
dom = pyquery.PyQuery(content)
File "/usr/local/lib/python2.7/dist-packages/pyquery/pyquery.py", line 226, in __init__
elements = fromstring(context, self.parser)
File "/usr/local/lib/python2.7/dist-packages/pyquery/pyquery.py", line 70, in fromstring
result = getattr(lxml.html, meth)(context)
File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 634, in fromstring
doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 532, in document_fromstring
value = etree.fromstring(html, parser, **kw)
File "lxml.etree.pyx", line 2754, in lxml.etree.fromstring (src/lxml/lxml.etree.c:54631)
File "parser.pxi", line 1578, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:82748)
File "parser.pxi", line 1457, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:81546)
File "parser.pxi", line 965, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:78216)
File "parser.pxi", line 569, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74472)
File "parser.pxi", line 650, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75363)
File "parser.pxi", line 599, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74827) lxml.etree.XMLSyntaxError: line 21: Tag fb:like invalid
任何人都可以帮我修复此错误吗?感谢。
答案 0 :(得分:1)
似乎您尝试解析的页面内容包含一些无效标记。通常,您可以做的最好的事情是捕获并记录此类错误,并优雅地前进到下一页。
希望你可以使用BeautifulSoup来提取下一页要抓取的网址,它会优雅地处理大部分不良内容。您可以找到有关BeatifulSoup以及如何使用它的更多详细信息here。
实际上在使用爬虫后,似乎在某些时候页面内容为空,因此解析器无法加载文档。
我使用BeautifoulSoup测试了爬虫并且它正常工作。如果您需要/希望我可以与您分享我的更新版本。
您可以轻松添加对空内容的检查,但我不确定您可能遇到的其他边缘情况,因此切换到BeautifulSoup似乎是一种更安全的方法。