我是scrapy的新手,我正在努力建造一个必须做这种工作的蜘蛛:
以递归方式和特定深度从通用网页中提取所有链接。
我尝试使用以下代码执行此操作:
class MySpider(CrawlSpider):
settings.overrides['DEPTH_LIMIT'] = 1
name = "cnet"
allowed_domains = ["cnet.com"]
start_urls = ["http://www.cnet.com/"]
rules = (Rule (SgmlLinkExtractor(allow_domains=('cnet.com',)), callback="parse_items", follow= True),)
def parse_items(self, response):
print ""
print "PARSE ITEMS"
print ""
hxs = HtmlXPathSelector(response)
titles = hxs.select('//a')
items = []
for titles in titles:
item = NewsItem()
item ["title"] = titles.select("text()").extract()
item ["link"] = titles.select("@href").extract()
if(len(item["link"]) > 0) and (self.allowed_domains[0] in item["link"][0]):
print ""
print response.meta['depth']
print item ["title"]
print item ["link"]
print ""
items.append(item)
return(items)
但它似乎继续INFINITE循环,有什么建议吗?
非常感谢!