Scrapy - 从链接列表中提取信息

时间:2021-05-31 23:03:45

标签: python-3.x web-scraping scrapy

我正在用 python 和 scrapy 编写一个刮板。我有一个包含产品列表的页面作为 start_urls,我的刮板获取这些产品的链接并抓取每个产品的信息(我将信息保存在类 items.py 的字段)。这些产品中的每一个都可以包含一个变体列表,我需要从所有变体中提取信息并将它们保存在列表字段中,然后将此信息保存在 item['variations'] 中。

def parse(self, response):
        links = response.css(css_links).getall()
        links = [self.process_url(link) for link in links]
        for link in links:
            link = urljoin(response.url, link)
            yield scrapy.Request(link, callback=self.parse_product)

    def parse_product(self, response):
        items = SellItem()
        shipper = self.get_shipper(response)
        items['shipper'] = shipper
        items['weight'] = self.get_weight(response)
        items['url'] = response.url
        items['category'] = self.get_category(response)
        items['cod'] = response.css(css_cod).get()
        items['price'] = self.get_price(response)
        items['cantidad'] = response.css(css_cantidad).get()
        items['name'] = response.css(css_name).get()
        items['images'] = self.get_images(response)
        variations = self.get_variations(response)
        if variations:
            valid_urls = self.get_valid_urls(variations)
            for link in valid_urls:
                #I need to go to each of these urls and scrape information and then store it in the 
                 #variable items['variations'].

1 个答案:

答案 0 :(得分:0)

您需要添加第二个方法,将其命名为“parse_details”

然后在您从第一个方法即 parse_product 执行“yield request”时添加 callback=self.parse_details。

您可以使用“response.meta”在方法之间传输收集的数据

Scrapy 在文档中介绍了它:

见:https://docs.scrapy.org/en/latest/topics/request-response.html#passing-additional-data-to-callback-functions

另请阅读:“Request.cb_kwargs”