Question

我正试图通过他们的搜索功能来抓取在线零售商的网页。例如，当请求发送到网址www.example.com/search时，该页面会返回指向所有网站产品的分页链接列表。每个产品页面或多或少都相同，因此相对简单。问题是返回了大约40,000个产品页面链接，我想在最初加载数据库时调用所有这些链接，然后安排一个刮刀每天运行以添加任何新产品。我想知道使用scrapy有效地刮掉这些40,000个产品页面会有什么好方法。现在我的代码是：

ExampleSpider(scrapy.Spider):
    next_page = 1
    last_page = 100
    start_urls = ['example.com/search?page={}'.format(next_page)]

    parse(self, response):
        yield scrapy.Request(response.url, callback=self.follow_product_links)
        yield scrapy.Request(response.url, callback=self.follow_pagination_links)

    follow_product_links(self,response):
        for href in selector_that_gets_all_the_product_links:
            yield response.follow(href, callback=self.parse_product)

    parse_product(self,response):
        # Scrape the product page and yield an item
        # details are not relevant to my problem

    follow_pagination_links(self,response):
        self.next_page += 1
        if self.next_page < self.last_page:
            url = 'example.com/search?page={}'.format(self.next_page)
            scrapy.Request(url, callback=self.parse)

这不是可运行的代码，但它应该让您知道我想要做什么。我知道Scrapy是异步的，应该会有所帮助，但有没有更好的技术可以使用？顺便说一句，我希望该网站公开了一个公共API来查询他们的数据库，但遗憾的是他们没有。

如何有效地从网站搜索功能返回的分页链接列表中抓取页面结果

0 个答案: