我正在玩从亚马逊搜集信息,但它让我很难过。到目前为止,我的蜘蛛看起来像这样:
class AmzCrawlerSpider(CrawlSpider):
name = 'amz_crawler'
allowed_domains = ['amazon.com']
start_urls = ['http://www.amazon.com/s?ie=UTF8&bbn=1267449011&page=1&rh=n%3A284507%2Cn%3A1055398%2Cn%3A%211063498%2Cn%3A1267449011%2Cn%3A3204211011']
rules = (Rule(SgmlLinkExtractor(allow=r"page=\d+"), callback="parse_item", follow= True),)
def parse_item(self, response):
category_name = Selector(response).xpath('//*[@id="nav-subnav"]/a[1]/text()').extract()[0]
products = Selector(response).xpath('//div[@class="s-item-container"]')
for product in products:
item = AmzItem()
item['title'] = product.xpath('.//a[contains(@class, "s-access-detail-page")]/@title').extract()[0]
item['url'] = product.xpath('.//a[contains(@class, "s-access-detail-page")]/@href').extract()[0]
request = scrapy.Request(item['url'], callback=self.parse_product)
request.meta['item'] = item
print "Crawl " + item["title"]
print "Crawl " + item['url']
yield request
def parse_product(self, response):
print ( "Parse Product" )
item = response.meta['item']
sel = Selector(response)
item['asin'] = sel.xpath('//td[@class="bucket"]/div/ul/li/b[contains(text(),"ASIN:")]/../text()').extract()[0]
return item
有两个问题我似乎无法理解: '解析产品'永远不会打印 - 所以我假设方法parse_product永远不会被执行,即使显示了Crawl ...的打印。 也许是关于规则的?
然后与规则相关: 它只适用于某个类别的第一页。抓取工具不会关注指向类别第二页的链接。 我假设Scrapy的页面是以不同的方式生成的,然后是浏览器?在控制台中,我看到了很多301重定向:
2015-06-30 14:57:24 + 0800 [amz_crawler] DEBUG:重定向(301)到http://www.amazon.com/s?ie=UTF8&page=1&rh=n%3A2619533011 %2CK%3Apet%20supplies%2Cp_72%3A2661618011%2Cp_n_date_first_available_absolute%3A2661609011>来自http://www.amazon.com/gp/search/ref=sr_pg_1?sf=qz&fst=as%3Aoff&rh=n%3A2619533011%2Ck%3Apet+supplies%2Cp_n_date_first_available_absolute%3A2661609011%2Cp_72%3A2661618011&sort =日期降序秩&安培;关键字=宠物+用品&安培,即= UTF8&安培; QID = 1435312739>
2015-06-30 14:57:29 + 0800 [amz_crawler] DEBUG:Crawled(200)http://www.amazon.com/s?ie=UTF8&page=1&rh=n%3A2619533011%2Ck%3Apet %20supplies%2Cp_72%3A2661618011%2Cp_n_date_first_available_absolute%3A2661609011> (引用者:无)
2015-06-30 14:57:39 + 0800 [amz_crawler] DEBUG:Crawled(200)http://www.amazon.com/s?ie=UTF8&page=1&rh=n%3A2619533011%2Cp_72%3A2661618011> ; (引用者:http://www.amazon.com/s?ie=UTF8&page=1&rh=n%3A2619533011%2Ck%3Apet%20supplies%2Cp_72%3A2661618011%2Cp_n_date_first_available_absolute%3A2661609011)
Crawl 2015-06-30 14:57:39珍贵的猫超高级丛生猫砂,40磅袋
抓取http://www.amazon.com/Precious-Cat-Premium-Clumping-Litter/dp/B0009X29WK
我做错了什么?