Question

因此，让我们开始：“我是新手”。现在已经测试了scrapy了几天，我正在尝试做我认为应该是一个简单的过程，但是我只是想不通... 我的想法是：

进入具有分页内容的页面
获取每个分页页面中单个项目的所有网址
进入每个单独的网址并抓取一些糖果：）

现在，我已经能够处理直到获得所有单个URL的部分为止，这是这样的：

import scrapy

class SomeSpider(scrapy.Spider):
    name= 'vinos'
    start_urls = [
        'https://www.somesite.com/es'
    ]
    def parse(self,response):
        for vino in response.css('div.product-container'):
            yield {
                'url' : vino.css("a.product-name::attr(href)").get()
            }
        next_pageVar = response.css(".pagination_next a::attr(href)").get()
        print(next_pageVar)

        if next_pageVar is not None:
            next_page = 'https://www.somesite.com' + next_pageVar
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

因此，如果我运行此脚本，则会得到一个包含所有单个URL的漂亮的csv

scrapy crawl vinos -o urls.csv

问题是，我无法弄清楚如何进入每个单独的项目并获取所需的数据，这实际上是我要导出为csv的内容（也就是以“ <”分隔的csv，因为“，”将妨碍某些字段）。

一旦我进入单个页面，这就是我想要得到的：

from some.items import SomeItem
item = SomeItem()
item["title"] =  response.css("h1::text").get()
item["recommendation"] = response.css(".recommendation text").get()
temperature = response.css(".temperature::text").get()
if temperature is not None:
    item["temperature"] = temperature

如上所述，我希望能够将每个单独页面收集的所有项目导出到带有“ <”分隔符的csv文件中。

关于如何做这两件事的任何想法？

非常感谢！

从分页列表中获取项目，然后使用Scrapy抓取详细信息

0 个答案: