为什么scrapy不会进入下一页并且只获得首页项目?

时间:2014-04-06 10:34:54

标签: python web-scraping scrapy

早些时候我也有一条规则,即

if domains in departments.keys():rules = (Rule(SgmlLinkExtractor(allow=("?tab_value=all&search_query=%s&search_constraint=%s&Find=Find&pref_store=1801&ss=false&ic=d_d" %(keyword,departments.get(domains)),),restrict_xpaths=('//li[@class="btn-nextResults"]'),),callback='parse',follow=True),),

但我删除了它,因为它调用的是parse方法,不推荐使用。

from scrapy.http import Request
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from walmart_sample.items import WalmartSampleItem


class MySpider(CrawlSpider):

    name = "my_spider"
    domains = ['All Departments']
    keyword = 'Laptop'
    departments = {"All Departments": "0", "Apparel": "5438", "Auto": "91083", "Baby": "5427", "Beauty": "1085666","Books": "3920", "Electronics": "3944", "Gifts": "1094765", "Grocery": "976759", "Health": "976760","Home": "4044", "Home Improvement": "1072864", "Jwelery": "3891", "Movies": "4096", "Music": "4104","Party": "2637", "Patio": "5428", "Pets": "5440", "Pharmacy": "5431", "Photo Center": "5426","Sports": "4125", "Toys": "4171", "Video Games": "2636"}
    allowed_domains = ['walmart.com']
    denied_domains = ['reviews.walmart.com','facebook.com','twitter.com']

    def start_requests(self):
        for domain in self.domains:
            if domain in self.departments:
                url = 'http://www.walmart.com/search/search-ng.do?search_query=%s&ic=16_0&Find=Find&search_constraint=%s' % (self.keyword, self.departments.get(domain))
                yield Request(url)

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        links = hxs.select('//a[@class="prodLink ListItemLink"]/@href')
        for link in links:
            href = link.extract()
            yield Request('http://www.walmart.com/' + href, self.parse_data) 
        next_link = hxs.select('//li[@class="btn-nextResults"]/@href').extract()
        if next_link:
            yield Request('http://www.walmart.com/search/search-ng.do' + next_link, self.parse)
        else:
            print "last Page"                

    def parse_data(self, response):
        hxs = HtmlXPathSelector(response)
        items=[]
        walmart=WalmartSampleItem()
        walmart['Title']=hxs.select('//h1[@class="productTitle"]/text()').extract()
        walmart['Price']=hxs.select('//span[@class="bigPriceText1"]/text()').extract()+hxs.select('//span[@class="smallPriceText1"]/text()').extract()
        walmart['Availability']=hxs.select('//span[@id="STORE_AVAIL"]/text()').extract()
        walmart['Description']=hxs.select('//span[@class="ql-details-short-desc"]/text()').extract()
        items.append(walmart)
        return items

1 个答案:

答案 0 :(得分:0)

我认为你只是在XPath中错过了下一页链接中的“/ a”步骤:

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        links = hxs.select('//a[@class="prodLink ListItemLink"]/@href')
        for link in links:
            href = link.extract()
            yield Request('http://www.walmart.com/' + href, self.parse_data)
        #
        #                                                    here
        #                                                      |
        #                                                      v
        next_link = hxs.select('//li[@class="btn-nextResults"]/a/@href').extract()
        if next_link:
            # and as hxs.select() will return a list, you should select the first element
            yield Request('http://www.walmart.com/search/search-ng.do' + next_link[0], self.parse)
        else:
            print "last Page"