我在使用Scrapy时遇到了麻烦。我需要的代码每个给定网址最多可以删除1000个内部链接。我的代码在命令行运行时有效,但蜘蛛不会停止,只接收消息。
我的代码如下:
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.item import Item, Field
from scrapy.contrib.closespider import CloseSpider
class MyItem(Item):
url= Field()
class MySpider(CrawlSpider):
name = 'testspider1'
allowed_domains = ['angieslist.com']
start_urls = ['http://www.angieslist.com']
rules = (Rule(SgmlLinkExtractor(), callback='parse_url', follow=True), )
def parse_url(self, response):
item = MyItem()
item['url'] = response.url
scrape_count = self.crawler.stats.get_value('item_scraped_count')
print scrape_count
limit = 10
if scrape_count == limit:
raise CloseSpider('Limit Reached')
return item
答案 0 :(得分:5)
我的问题是试图在错误的地方应用近距离蜘蛛。它是一个需要在settings.py文件中设置的变量。当我在那里手动设置它,或者在命令行中将它设置为参数时,它起作用(在N的10-20之内停止它的价值)。
settings.py:
BOT_NAME = 'internal_links'
SPIDER_MODULES = ['internal_links.spiders']
NEWSPIDER_MODULE = 'internal_links.spiders'
CLOSESPIDER_PAGECOUNT = 1000
ITEM_PIPELINES = ['internal_links.pipelines.CsvWriterPipeline']
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'yo mama'
LOG_LEVEL = 'DEBUG'