就我目前的知识而言,我编写了一个小型的Web蜘蛛/爬虫,能够以可变的嵌套深度递归爬行,也能够在爬行之前进行可选的POST / GET预登录(如果需要)。 / p>
由于我是一个完全的初学者,我想得到一些反馈,改进或任何你的投入。
我只在这里添加parser
功能。整个源代码可以在github上查看:https://github.com/cytopia/crawlpy
我真正想要确保的是,与yield
结合使用的递归效率尽可能高,并且我也是以正确的方式进行的。
对此的任何评论和编码风格都非常受欢迎。
def parse(self, response):
"""
Scrapy parse callback
"""
# Get current nesting level
if response.meta.has_key('depth'):
curr_depth = response.meta['depth']
else:
curr_depth = 1
# Only crawl the current page if we hit a HTTP-200
if response.status == 200:
hxs = Selector(response)
links = hxs.xpath("//a/@href").extract()
# We stored already crawled links in this list
crawled_links = []
# Pattern to check proper link
linkPattern = re.compile("^(?:http|https):\/\/(?:[\w\.\-\+]+:{0,1}[\w\.\-\+]*@)?(?:[a-z0-9\-\.]+)(?::[0-9]+)?(?:\/|\/(?:[\w#!:\.\?\+=&%@!\-\/\(\)]+)|\?(?:[\w#!:\.\?\+=&%@!\-\/\(\)]+))?$")
for link in links:
# Link could be a relative url from response.url
# such as link: '../test', respo.url: http://dom.tld/foo/bar
if link.find('../') == 0:
link = response.url + '/' + link
# Prepend BASE URL if it does not have it
elif 'http://' not in link and 'https://' not in link:
link = self.base_url + link
# If it is a proper link and is not checked yet, yield it to the Spider
if (link
and linkPattern.match(link)
and link.find(self.base_url) == 0):
#and link not in crawled_links
#and link not in uniques):
# Check if this url already exists
re_exists = re.compile('^' + link + '$')
exists = False
for i in self.uniques:
if re_exists.match(i):
exists = True
break
if not exists:
# Store the shit
crawled_links.append(link)
self.uniques.append(link)
# Do we recurse?
if curr_depth < self.depth:
request = Request(link, self.parse)
# Add meta-data about the current recursion depth
request.meta['depth'] = curr_depth + 1
yield request
else:
# Nesting level too deep
pass
else:
# Link not in condition
pass
#
# Final return (yield) to user
#
for url in crawled_links:
#print "FINAL FINAL FINAL URL: " + response.url
item = CrawlpyItem()
item['url'] = url
item['depth'] = curr_depth
yield item
#print "FINAL FINAL FINAL URL: " + response.url
#item = CrawlpyItem()
#item['url'] = response.url
#yield item
else:
# NOT HTTP 200
pass
答案 0 :(得分:2)
您的整个代码可以简化为:
from scrapy.linkextractors import LinkExtractor
def parse(self, response):
# Get current nesting level
curr_depth = response.meta.get('depth',1)
item = CrawlpyItem() # could also just be `item = dict()`
item['url'] = response.url
item['depth'] = curr_depth
yield item
links = LinkExtractor().extract_links(response)
for link in links:
yield Request(link.url, meta={'depth': curr_depth+1})
如果我理解你想要做什么,这里广泛抓取所有网址,产生深度和网址作为项目吗?
Scrapy默认启用了dupe过滤器,因此您不需要自己执行该逻辑。此外,您的parse()
方法除了响应200之外永远不会收到任何内容,因此检查无用。
编辑:返工以避免欺骗。