我怎样才能跳到Scrapy Rules的下一页

时间:2014-01-13 16:21:32

标签: python web-scraping web-crawler scrapy

我已设置规则以从start_url获取下一页,但它不起作用,它只抓取start_urls页面和该页面中的链接(使用parseLinks)。它不会转到规则中设置的下一页。

任何帮助?

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy import log
from urlparse import urlparse
from urlparse import urljoin
from scrapy.http import Request

class MySpider(CrawlSpider):
    name = 'testes2'
    allowed_domains = ['example.com']
    start_urls = [
    'http://www.example.com/pesquisa/filtro/?tipo=0&local=0'
]

rules = (Rule(SgmlLinkExtractor(restrict_xpaths=('//a[@id="seguinte"]/@href')), follow=True),)

def parse(self, response):
     sel = Selector(response)
     urls = sel.xpath('//div[@id="btReserve"]/../@href').extract()
     for url in urls:
        url = urljoin(response.url, url)
        self.log('URLS: %s' % url)
        yield Request(url, callback = self.parseLinks)

def parseLinks(self, response):
    sel = Selector(response)
    titulo = sel.xpath('h1/text()').extract()
    morada = sel.xpath('//div[@class="MORADA"]/text()').extract()
    email = sel.xpath('//a[@class="sendMail"][1]/text()')[0].extract()
    url = sel.xpath('//div[@class="contentContacto sendUrl"]/a/text()').extract()
    telefone = sel.xpath('//div[@class="telefone"]/div[@class="contentContacto"]/text()').extract()
    fax = sel.xpath('//div[@class="fax"]/div[@class="contentContacto"]/text()').extract()
    descricao = sel.xpath('//div[@id="tbDescricao"]/p/text()').extract()
    gps = sel.xpath('//td[@class="sendGps"]/@style').extract()

    print titulo, email, morada

2 个答案:

答案 0 :(得分:4)

您不应该从parse覆盖CrawlSpider方法,否则将不会跟踪Rule

请参阅http://doc.scrapy.org/en/latest/topics/spiders.html#crawling-rules

上的警告
  

编写爬网蜘蛛规则时,请避免使用parse作为回调,因为CrawlSpider使用parse方法本身来实现其逻辑。因此,如果您覆盖解析方法,则爬网蜘蛛将不再起作用。

答案 1 :(得分:1)

您正在使用Spider Class流程:

class MySpider(CrawlSpider): is not the proper class
    instead of this use : class MySpider(Spider)
name = 'testes2'
allowed_domains = ['example.com']
start_urls = [
'http://www.example.com/pesquisa/filtro/?tipo=0&local=0'
]

In Spider Class you do not need rules. So discard it.
"Not Usable in Spider Class" rules = (Rule(SgmlLinkExtractor(restrict_xpaths=('//a[@id="seguinte"]/@href')), follow=True),)