如何抓取Scrapy教程中的“下一页”?

时间:2020-07-27 16:55:39

标签: python scrapy

我正在做scrapy tutorial,并且在“ Craigslist Scrapy Spider#3 –多页”部分中,但是在按照给出的说明进行操作后,无法获得多个页面。我所做的与本教程显示的内容之间的唯一区别是,我使用了“所有作业”,而不是仅使用工程作业(因为只有一页工程作业)。下面是我的代码

import scrapy

from scrapy import Request

class JobsSpider(scrapy.Spider):
    name = 'jobs-new'
    allowed_domains = ['craigslist.org']
    start_urls = ['https://newyork.craigslist.org/search/jjj']

def parse(self, response):
    

    jobs = response.xpath('//p[@class="result-info"]')
    for job in jobs:
        title = job.xpath('a/text()').extract_first()
        address = job.xpath('span[@class="result-meta"]/span[@class="result-hood"]/text()').extract_first("")[2:-1]
        relative_url = job.xpath('a/@href').extract_first()
        absolute_url = response.urljoin(relative_url)

        yield{'URL':absolute_url, 'Title':title, 'Address':address}
    

    relative_next_url = response.xpat('//a[@class="button next"]/@href').extract_first()
    absolute_next_url = response.urljoin(relative_next_url)

    yield request(absolute_next_url, callback=self.parse)
    

我在终端中使用

scrapy crawl jobs-new -o jobs-new.csv

但是.csv文件中只有第一页结果。

要获得一页以上的内容,我需要做什么?教程不正确还是我理解不正确?

1 个答案:

答案 0 :(得分:0)

我只是编辑您的代码,然后发现就可以了。

import scrapy
from scrapy import Request

class JobsSpider(scrapy.Spider):
    name = 'jobs-new'
    allowed_domains = ['craigslist.org']
    start_urls = ['https://newyork.craigslist.org/search/jjj']

    def parse(self, response):
        jobs = response.xpath('//p[@class="result-info"]')
        for job in jobs:
            title = job.xpath('a/text()').extract_first()
            address = job.xpath('span[@class="result-meta"]/span[@class="result-hood"]/text()').extract_first("")[2:-1]
            relative_url = job.xpath('a/@href').extract_first()
            absolute_url = response.urljoin(relative_url)

            yield {'URL': absolute_url, 'Title': title, 'Address': address}

        relative_next_url = response.xpath('//a[@class="button next"]/@href').extract_first()
        absolute_next_url = response.urljoin(relative_next_url)

        yield scrapy.Request(absolute_next_url, callback=self.parse)

这是一些输出

{'URL': 'https://newyork.craigslist.org/brk/trp/d/brooklyn-overnight-parking-attendant/7166876233.html', 'Title': 'Overnight Parking Attendant', 'Address': 'Brooklyn, NY'}

{'URL':'https://newyork.craigslist.org/wch/fbh/d/yonkers-experience-grill-man/7166875818.html','Title':'Experience grill man','地址':'Yonkers'}