为什么我无法使用scrappy跟随下一页链接

时间:2015-10-28 15:09:27

标签: python web-scraping scrapy

好吧我知道为什么因为没有为next_page变量提取任何内容但我不确定我是否正确使用xpath

import scrapy
from scrapy.selector import HtmlXPathSelector
from scrapy.http.request import Request


class SunBizSpider(scrapy.Spider):
name = 'sunbiz'
start_urls = ['http://search.sunbiz.org/Inquiry/CorporationSearch/SearchResults?inquiryType=EntityName&searchNameOrder=A&searchTerm=a']

def parse(self, response):
    for href in response.css('.large-width a::attr(href)'):
        full_url = response.urljoin(href.extract())
        yield scrapy.Request(full_url, callback=self.parse_question)



def parse_question(self, response):
    re1='((?:[0]?[1-9]|[1][012])[-:\\/.](?:(?:[0-2]?\\d{1})|(?:[3][01]{1}))[-:\\/.](?:(?:[1]{1}\\d{1}\\d{1}\\d{1})|(?:[2]{1}\\d{3})))(?![\\d])' # MMDDYYYY 1
    hxs = HtmlXPathSelector(response)
    date = response.xpath('//span').re_first(re1)
    next_page = hxs.select("//div[@class='navigationBar']/@href").extract()
    yield {
        'Name': response.css('.corporationName span::text').extract()[1],
        'Date': date,
        'Link': response.url,
        }
    if next_page:
        yield scrapy.Request(next_page[1], callback=self.parse_question)

1 个答案:

答案 0 :(得分:1)

首先,如果您已使用HtmlXPathSelector作为选择器,则不需要responseresponse可以处理css和xpath,所以不要担心它。

其次,您正在尝试获取此xpath "//div[@class='navigationBar']/@href"的链接,其中从div 获取href属性,您应该同意这是不正确的, href个标记属于<a>个标记,因此在这种情况下,您应该使用的xpath是:

"//div[@class='navigationBar'][1]//a[@title='Next On List']/@href"