Question

我想从网站上获取数据，但要获取数据，我必须从主页访问链接，然后获取数据，然后我需要返回主页，然后重复访问链接的周期，获取数据，然后返回。

我知道如何访问链接并获取数据，但我想知道如何访问其他链接，并返回到访问第一个链接后的位置。

这是我当前的代码：

# -*- coding: utf-8 -*-
import scrapy


class SsFamilleSpider(scrapy.Spider):
    name = 'ss_famille'
    allowed_domains = ['rexel.fr']
    start_urls = ['https://www.rexel.fr/frx/browse/category']

def parse(self, response):
    ssfamille = response.xpath("//div[@class='MML2 subDropDownMenu default browse-products-menu categoryList-container']//li//a/@href").get()
    yield {'ssfamille': ssfamille}
    test = response.xpath("//div[@id='facet_category']//div[@class='allFacetValues']//li//label[@class=' facet_leftCheckBox-label']//span/text()").extract()
    yield {'test': test}
    next_page = response.xpath("//div[@class='MML2 subDropDownMenu default browse-products-menu categoryList-container']//li//a/@href").get()
    if next_page is not None:
        yield response.follow(next_page, callback = self.parse)

Answer 1

您无需在页面之间来回移动即可跟随主页上的每个链接。相反，首先选择并产生所有主页链接。选择多个链接时，您需要使用getall()来获取所有匹配项。 get()仅返回第一个匹配项。然后，您需要遍历结果：

next_pages = response.xpath("//div[@class='MML2 subDropDownMenu default browse-products-menu categoryList-container']//li//a/@href").getall()
for next_page in next_pages:
    yield response.follow(next_page, callback = self.parse)

如何使用Scrapy在链接之间导航？

1 个答案: