使用单个scrapy spider逐页提取数据

时间:2016-09-16 07:19:54

标签: python web scrapy scrapy-spider

我正在尝试从goodreads.

中提取数据

我希望使用一段时间延迟逐页抓取页面。

我的蜘蛛看起来像:

import scrapy
import unidecode
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from lxml import html


class ElementSpider(scrapy.Spider):
    name = 'books'
    download_delay = 3
    allowed_domains = ["https://www.goodreads.com"]
    start_urls = ["https://www.goodreads.com/list/show/19793.I_Marked_My_Calendar_For_This_Book_s_Release?page=1",
                   ]
    rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[@class="next_page"]',)), callback="parse", follow= True),)

    def parse(self, response):
        for href in response.xpath('//div[@id="all_votes"]/table[@class="tableList js-dataTooltip"]/tr/td[2]/div[@class="js-tooltipTrigger tooltipTrigger"]/a/@href'):
            full_url = response.urljoin(href.extract())
            print full_url
            yield scrapy.Request(full_url, callback=self.parse_books)

        next_page = response.xpath('.//a[@class="button next"]/@href').extract()
        if next_page:
            next_href = next_page[0]
            print next_href
            next_page_url = 'https://www.goodreads.com' + next_href
            request = scrapy.Request(url=next_page_url)
            yield request

    def parse_books(self, response):
        yield{
            'url': response.url,
            'title':response.xpath('//div[@id="metacol"]/h1[@class="bookTitle"]/text()').exract(),

        } 

请建议我这样做,这样就可以通过运行一次蜘蛛来提取所有页面数据。

2 个答案:

答案 0 :(得分:0)

我在代码中进行了更改,但它确实有效。改变是 -

request = scrapy.Request(url=next_page_url)

应该是

request = scrapy.Request(next_page_url, self.parse)

当我评论allowed_domains = ["https://www.goodreads.com"]时效果很好,否则json文件中没有保存数据。 谁能解释为什么?

答案 1 :(得分:0)

the documentation中看起来allowed_domains需要更好的解释,但如果您查看示例,那么域的结构应该类似于domain.com,因此避免使用方案和不必要的子域({ {1}}是子域名