Question

我已阅读Scrapy: Follow link to get additional Item data?并关注它，但它不起作用，可能是一个简单的错误，所以我正在放置我的蜘蛛的源代码。

import scrapy
from scrapy.spider import Spider
from scrapy.selector import Selector

class MySpider1(Spider):
    name = "timeanddate"
    allowed_domains = ["http://www.timeanddate.com"]
    start_urls = (
        'http://www.timeanddate.com/holidays/',
    )

    def parse(self, response):
        countries = Selector(response).xpath('//div[@class="fixed"]//li/a[contains(@href, "/holidays/")]')

        for item in countries:

            link = item.xpath('@href').extract()[0]
            country = item.xpath('text()').extract()[0]

            linkToFollow = self.allowed_domains[0] + link + "/#!hol=1"

            print link  # link
            print country  # text in a HTML tag
            print linkToFollow

            request = scrapy.Request(linkToFollow, callback=self.parse_page2)


    def parse_page2(self, response):
        print "XXXXXX"
        hxs = HtmlXPathSelector(response)

        print hxs

我正在尝试获取每个国家/地区的所有假期列表，这是我需要访问另一个页面。

我无法理解为什么不调用parse_page2。

Answer 1

我可以使用Link Extractors

让您的示例正常工作

以下是一个例子：

#-*- coding: utf-8 -*-
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors.lxmlhtml import LxmlLinkExtractor

class TimeAndDateSpider(CrawlSpider):
    name = "timeanddate"
    allowed_domains = ["timeanddate.com"]
    start_urls = [
        "http://www.timeanddate.com/holidays/",
    ]


    rules = (
            Rule (LxmlLinkExtractor(restrict_xpaths=('//div[@class="fixed"]//li/a[contains(@href, "/holidays/")]',))
                , callback='second_page'),
            ) 

    #2nd page
    def second_page(self,response):
        print "second page - %s" % response.url

将继续尝试使Request回调示例正常工作

Scrapy Spider不遵循请求回调

1 个答案: