用scrapy抓取多个页面

时间:2014-05-27 19:46:04

标签: python web-scraping scrapy scrapy-spider

我正在尝试使用scrapy来抓取一个包含多页信息的网站。

我的代码是:

from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from tcgplayer1.items import Tcgplayer1Item


class MySpider(BaseSpider):
    name = "tcg"
    allowed_domains = ["http://www.tcgplayer.com/"]
    start_urls = ["http://store.tcgplayer.com/magic/journey-into-nyx?PageNumber=1"]

    def parse(self, response):
        hxs = Selector(response)
        titles = hxs.xpath("//div[@class='magicCard']")
        for title in titles:
            item = Tcgplayer1Item()
            item["cardname"] = title.xpath(".//li[@class='cardName']/a/text()").extract()[0]

            vendor = title.xpath(".//tr[@class='vendor ']")
            item["price"] = vendor.xpath("normalize-space(.//td[@class='price']/text())").extract()
            item["quantity"] = vendor.xpath("normalize-space(.//td[@class='quantity']/text())").extract()
            item["shipping"] = vendor.xpath("normalize-space(.//span[@class='shippingAmount']/text())").extract()
            item["condition"] = vendor.xpath("normalize-space(.//td[@class='condition']/a/text())").extract()
            item["vendors"] = vendor.xpath("normalize-space(.//td[@class='seller']/a/text())").extract()
            yield item

我试图抓取所有页面,直到它到达页面的末尾...有时会有比其他页面更多的页面,因此很难准确说出页码的结束位置。

1 个答案:

答案 0 :(得分:9)

我们的想法是增加pageNumber,直到找不到titles为止。如果页面上没有titles - 抛出CloseSpider异常来阻止蜘蛛:

from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from scrapy.exceptions import CloseSpider
from scrapy.http import Request
from tcgplayer1.items import Tcgplayer1Item


URL = "http://store.tcgplayer.com/magic/journey-into-nyx?pageNumber=%d"

class MySpider(BaseSpider):
    name = "tcg"
    allowed_domains = ["tcgplayer.com"]
    start_urls = [URL % 1]

    def __init__(self):
        self.page_number = 1

    def parse(self, response):
        print self.page_number
        print "----------"

        sel = Selector(response)
        titles = sel.xpath("//div[@class='magicCard']")
        if not titles:
            raise CloseSpider('No more pages')

        for title in titles:
            item = Tcgplayer1Item()
            item["cardname"] = title.xpath(".//li[@class='cardName']/a/text()").extract()[0]

            vendor = title.xpath(".//tr[@class='vendor ']")
            item["price"] = vendor.xpath("normalize-space(.//td[@class='price']/text())").extract()
            item["quantity"] = vendor.xpath("normalize-space(.//td[@class='quantity']/text())").extract()
            item["shipping"] = vendor.xpath("normalize-space(.//span[@class='shippingAmount']/text())").extract()
            item["condition"] = vendor.xpath("normalize-space(.//td[@class='condition']/a/text())").extract()
            item["vendors"] = vendor.xpath("normalize-space(.//td[@class='seller']/a/text())").extract()
            yield item

        self.page_number += 1
        yield Request(URL % self.page_number)

这个特殊的蜘蛛会抛出所有8页数据,然后停止。

希望有所帮助。