Scrapy请求回调未被触发

时间:2015-06-30 07:07:46

标签: scrapy rules

我正在玩从亚马逊搜集信息,但它让我很难过。到目前为止,我的蜘蛛看起来像这样:

class AmzCrawlerSpider(CrawlSpider):
    name = 'amz_crawler'
    allowed_domains = ['amazon.com']
    start_urls = ['http://www.amazon.com/s?ie=UTF8&bbn=1267449011&page=1&rh=n%3A284507%2Cn%3A1055398%2Cn%3A%211063498%2Cn%3A1267449011%2Cn%3A3204211011']

    rules = (Rule(SgmlLinkExtractor(allow=r"page=\d+"), callback="parse_item", follow= True),)

    def parse_item(self, response):
        category_name = Selector(response).xpath('//*[@id="nav-subnav"]/a[1]/text()').extract()[0]
        products = Selector(response).xpath('//div[@class="s-item-container"]')

        for product in products:
            item = AmzItem()
            item['title'] = product.xpath('.//a[contains(@class, "s-access-detail-page")]/@title').extract()[0]
            item['url'] = product.xpath('.//a[contains(@class, "s-access-detail-page")]/@href').extract()[0]
            request = scrapy.Request(item['url'], callback=self.parse_product)
            request.meta['item'] = item
            print "Crawl " + item["title"]
            print "Crawl " + item['url']
            yield request

    def parse_product(self, response):
        print ( "Parse Product" )
        item = response.meta['item']
        sel = Selector(response)
        item['asin'] = sel.xpath('//td[@class="bucket"]/div/ul/li/b[contains(text(),"ASIN:")]/../text()').extract()[0]

        return item

有两个问题我似乎无法理解: '解析产品'永远不会打印 - 所以我假设方法parse_product永远不会被执行,即使显示了Crawl ...的打印。 也许是关于规则的?

然后与规则相关: 它只适用于某个类别的第一页。抓取工具不会关注指向类别第二页的链接。 我假设Scrapy的页面是以不同的方式生成的,然后是浏览器?在控制台中,我看到了很多301重定向:

  

2015-06-30 14:57:24 + 0800 [amz_crawler] DEBUG:重定向(301)到http://www.amazon.com/s?ie=UTF8&page=1&rh=n%3A2619533011 %2CK%3Apet%20supplies%2Cp_72%3A2661618011%2Cp_n_date_first_available_absolute%3A2661609011>来自http://www.amazon.com/gp/search/ref=sr_pg_1?sf=qz&fst=as%3Aoff&rh=n%3A2619533011%2Ck%3Apet+supplies%2Cp_n_date_first_available_absolute%3A2661609011%2Cp_72%3A2661618011&sort =日期降序秩&安培;关键字=宠物+用品&安培,即= UTF8&安培; QID = 1435312739>
  2015-06-30 14:57:29 + 0800 [amz_crawler] DEBUG:Crawled(200)http://www.amazon.com/s?ie=UTF8&page=1&rh=n%3A2619533011%2Ck%3Apet %20supplies%2Cp_72%3A2661618011%2Cp_n_date_first_available_absolute%3A2661609011> (引用者:无)
  2015-06-30 14:57:39 + 0800 [amz_crawler] DEBUG:Crawled(200)http://www.amazon.com/s?ie=UTF8&page=1&rh=n%3A2619533011%2Cp_72%3A2661618011> ; (引用者:http://www.amazon.com/s?ie=UTF8&page=1&rh=n%3A2619533011%2Ck%3Apet%20supplies%2Cp_72%3A2661618011%2Cp_n_date_first_available_absolute%3A2661609011
  Crawl 2015-06-30 14:57:39珍贵的猫超高级丛生猫砂,40磅袋
  抓取http://www.amazon.com/Precious-Cat-Premium-Clumping-Litter/dp/B0009X29WK

我做错了什么?

0 个答案:

没有答案