scrapy将start_url传输到后续请求

时间:2012-08-02 21:44:14

标签: python web-crawler scrapy

因为三天我试图在meta属性中保存相应的start_urs以将其作为项目传递给scrapy中的后续请求,所以我可以使用start_url来调用字典以使用其他数据填充我的输出。实际上它应该是直截了当的,因为它在documentation ...

中有解释

Google scrapy group中有一个讨论,还有一个问题here,但我无法让它运行:(

我是scrapy的新手,我认为这是一个很棒的框架,但对于我的项目,我必须知道所有请求的start_urls,它看起来很复杂。

我真的很感激一些帮助!

目前我的代码看起来像这样:

class example(CrawlSpider):

    name = 'example'
    start_urls = ['http://www.example.com']

    rules = (
    Rule(SgmlLinkExtractor(allow=('/blablabla/', )), callback='parse_item'),
    )

    def parse(self, response):
        for request_or_item in super(example, self).parse(response):
            if isinstance(request_or_item, Request):
                request_or_item = request_or_item.replace(meta = {'start_url':   response.meta['start_url']})
            yield request_or_item

    def make_requests_from_url(self, url):
         return Request(url, dont_filter=True, meta = {'start_url': url})


    def parse_item(self, response):
        hxs = HtmlXPathSelector(response)
        item = testItem()
        print response.request.meta, response.url

1 个答案:

答案 0 :(得分:2)

我想删除这个答案,因为它没有解决OP的问题,但我想把它作为一个scrapy示例。


Warning

  

编写爬网蜘蛛规则时,请避免使用parse作为回调   CrawlSpider使用parse方法本身来实现其逻辑。   因此,如果您覆盖parse方法,则抓取蜘蛛将不再存在   工作

改为使用BaseSpider

class Spider(BaseSpider):

    name = "domain_spider"


    def start_requests(self):

        last_domain_id = 0
        chunk_size = 10
        cursor = settings.db.cursor()

        while True:
            cursor.execute("""
                    SELECT domain_id, domain_url  
                    FROM domains  
                    WHERE domain_id > %s AND scraping_started IS NULL  
                    LIMIT %s
                """, (last_domain_id, chunk_size))
            self.log('Requesting %s domains after %s' % (chunk_size, last_domain_id))
            rows = cursor.fetchall()
            if not rows:
                self.log('No more domains to scrape.')
                break

            for domain_id, domain_url in rows:
                last_domain_id = domain_id
                request = self.make_requests_from_url(domain_url)
                item = items.Item()
                item['start_url'] = domain_url
                item['domain_id'] = domain_id
                item['domain'] = urlparse.urlparse(domain_url).hostname
                request.meta['item'] = item

                cursor.execute("""
                        UPDATE domains  
                        SET scraping_started = %s
                        WHERE domain_id = %s  
                    """, (datetime.now(), domain_id))

                yield request

    ...