我已阅读Scrapy: Follow link to get additional Item data?并关注它,但它不起作用,可能是一个简单的错误,所以我正在放置我的蜘蛛的源代码。
import scrapy
from scrapy.spider import Spider
from scrapy.selector import Selector
class MySpider1(Spider):
name = "timeanddate"
allowed_domains = ["http://www.timeanddate.com"]
start_urls = (
'http://www.timeanddate.com/holidays/',
)
def parse(self, response):
countries = Selector(response).xpath('//div[@class="fixed"]//li/a[contains(@href, "/holidays/")]')
for item in countries:
link = item.xpath('@href').extract()[0]
country = item.xpath('text()').extract()[0]
linkToFollow = self.allowed_domains[0] + link + "/#!hol=1"
print link # link
print country # text in a HTML tag
print linkToFollow
request = scrapy.Request(linkToFollow, callback=self.parse_page2)
def parse_page2(self, response):
print "XXXXXX"
hxs = HtmlXPathSelector(response)
print hxs
我正在尝试获取每个国家/地区的所有假期列表,这是我需要访问另一个页面。
我无法理解为什么不调用parse_page2。
答案 0 :(得分:1)
我可以使用Link Extractors
让您的示例正常工作以下是一个例子:
#-*- coding: utf-8 -*-
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors.lxmlhtml import LxmlLinkExtractor
class TimeAndDateSpider(CrawlSpider):
name = "timeanddate"
allowed_domains = ["timeanddate.com"]
start_urls = [
"http://www.timeanddate.com/holidays/",
]
rules = (
Rule (LxmlLinkExtractor(restrict_xpaths=('//div[@class="fixed"]//li/a[contains(@href, "/holidays/")]',))
, callback='second_page'),
)
#2nd page
def second_page(self,response):
print "second page - %s" % response.url
将继续尝试使Request回调示例正常工作