Question

我不确定我应该如何构造我的代码，以便每次函数递归调用自身时都会更新offset参数。这里有关于我的脚本和我试图解决的挑战的更多细节。我觉得我有一些简单的解决方法，我在这里失踪了。 Scraping Website With Infinite Scroll Using Scrapy

import scrapy
import json
import requests

class LetgoSpider(scrapy.Spider):
    name = 'letgo'
    allowed_domains = ['letgo.com/en']
    start_urls = ['https://search-products-pwa.letgo.com/api/products?country_code=US&offset=0&quadkey=0320030123201&num_results=50&distance_type=mi']

    def parse(self, response):
        data = json.loads(response.text)
        for used_item in data:
            if len(data) == 0:
                break
            try:
                title = used_item['name']
                price = used_item['price']
                description = used_item['description']
                date = used_item['updated_at']
                images = [img['url'] for img in used_item['images']]
                latitude = used_item['geo']['lat']
                longitude = used_item['geo']['lng']               
            except Exception:
                pass

        yield {'Title': title,
               'Price': price,
               'Description': description,
               'Date': date,
               'Images': images,
               'Latitude': latitude,
               'Longitude': longitude          
               }    

        i = 0
        for new_items_load in response:
            i += 50 
            offset = i
            new_request = 'https://search-products-pwa.letgo.com/api/products?country_code=US&offset=' + str(i) + \
                          '&quadkey=0320030123201&num_results=50&distance_type=mi'
            yield scrapy.Request(new_request, callback=self.parse)

Answer 1

将偏移量定义为类属性：

class LetgoSpider(scrapy.Spider):
    name = 'letgo'
    allowed_domains = ['letgo.com/en']
    start_urls = ['https://search-products-pwa.letgo.com/api/products?country_code=US&offset=0&quadkey=0320030123201&num_results=50&distance_type=mi']
    offset = 0  # <- here

然后，您可以使用self.offset来引用它，并且将在所有函数parse调用中共享该值。所以它是这样的：

self.offset += 50
new_request = 'https://search-products-pwa.letgo.com/api/products?country_code=US&offset=' + str(self.offset) + \
                      '&quadkey=0320030123201&num_results=50&distance_type=mi'
yield scrapy.Request(new_request, callback=self.parse)

使用无限滚动更新Web Scraping的参数

1 个答案: