从已删除的链接获取信息

时间:2018-03-27 17:04:25

标签: python-3.x scrapy scrapy-spider

我首先尝试获取该书的链接,然后进入该链接并获取该书的标题。最后,我想在一个列中存储标题,并在csv文件的另一列中链接。这就是我写这本书的方式。我只获得链接而不是标题。

import scrapy


class AmazonSpiderSpider(scrapy.Spider):
    name = 'amazon_spider'
    allowed_domains = ['www.amazon.com']
    start_urls = ['https://www.amazon.com/s/ref=dp_bc_3?ie=UTF8&node=468216&rh=n%3A283155%2Cn%3A%212349030011%2Cn%3A465600%2C']

    def parse(self, response):
        links = response.xpath('//*[@class="a-link-normal s-access-detail-page  s-color-twister-title-link a-text-normal"]/@href').extract()

        for link in links:
            yield {'Book Urls': link}
            yield scrapy.Request(link, callback=self.book_title)

    def book_title(self, response):
        title = response.xpath('//*[@id="productTitle"]/text()').extract_first()
        yield {'Title': title}

1 个答案:

答案 0 :(得分:0)

我用response.meta解决了这个问题。

import scrapy
class AmazonSpiderSpider(scrapy.Spider):
    name = 'amazon_spider'
    allowed_domains = ['www.amazon.com']
    start_urls = ['https://www.amazon.com/s/ref=dp_bc_3?ie=UTF8&node=468216&rh=n%3A283155%2Cn%3A%212349030011%2Cn%3A465600%2C']

    def parse(self, response):
        links = response.xpath('//*[@class="a-link-normal s-access-detail-page  s-color-twister-title-link a-text-normal"]/@href').extract()
        for link in links:
            title = response.meta.get('title')
            yield scrapy.Request(link, callback=self.book_title, meta = {'title':title, 'Link': link})


    def book_title(self, response):
        title = response.xpath('//*[@id="productTitle"]/text()').extract()
        response.meta['title'] = title
        yield response.meta