Question

简介

由于我必须更深入地进行爬网，因此我面临下一个问题：爬网嵌套页面，例如：https://www.karton.eu/Faltkartons

我的搜寻器必须从此页面开始，转到https://www.karton.eu/Einwellige-Kartonagen，并访问此类别中列出的每个产品。

对于每个类别中包含的每个产品，都应该对“ Faltkartons”的每个子类别执行此操作。

已编辑

我的代码现在看起来像这样：

import scrapy
from ..items import KartonageItem

class KartonSpider(scrapy.Spider):
    name = "kartons12"
    allow_domains = ['karton.eu']
    start_urls = [
        'https://www.karton.eu/Faltkartons'
        ]
    custom_settings = {'FEED_EXPORT_FIELDS': ['SKU', 'Title', 'Link', 'Price', 'Delivery_Status', 'Weight', 'QTY', 'Volume'] } 
    
    def parse(self, response):
        url = response.xpath('//div[@class="cat-thumbnails"]')

        for a in url:
            link = a.xpath('a/@href')
            yield response.follow(url=link.get(), callback=self.parse_category_cartons)

    def parse_category_cartons(self, response):
        url2 = response.xpath('//div[@class="cat-thumbnails"]')

        for a in url2:
            link = a.xpath('a/@href')
            yield response.follow(url=link.get(), callback=self.parse_target_page)

    def parse_target_page(self, response):
        card = response.xpath('//div[@class="text-center articelbox"]')

        for a in card:
            items = KartonageItem()
            link = a.xpath('a/@href')
            items ['SKU'] = a.xpath('.//div[@class="delivery-status"]/small/text()').get()
            items ['Title'] = a.xpath('.//h5[@class="title"]/a/text()').get()
            items ['Link'] = a.xpath('.//h5[@class="text-center artikelbox"]/a/@href').extract()
            items ['Price'] = a.xpath('.//strong[@class="price-ger price text-nowrap"]/span/text()').get()
            items ['Delivery_Status'] = a.xpath('.//div[@class="signal_image status-2"]/small/text()').get()
            yield response.follow(url=link.get(),callback=self.parse_item, meta={'items':items})

    def parse_item(self,response):
        table = response.xpath('//div[@class="product-info-inner"]')

        items = KartonageItem()
        items = response.meta['items']
        items['Weight'] = a.xpath('.//span[@class="staffelpreise-small"]/text()').get()
        items['Volume'] = a.xpath('.//td[@class="icon_contenct"][7]/text()').get()
        yield items

在我脑海中，它从start_url开始，然后我访问https://www.karton.eu/Einwellige-Kartonagen，寻找链接并关注它们 https://www.karton.eu/einwellig-ab-100-mm。在该页面上，它会检查卡片，以获取一些信息，然后点击指向特定产品页面的链接以获取最新商品。

我的方法的哪一部分错了？我应该将课程从“ scrapy.Spider”更改为“ crawl.spider”吗？还是仅在我要设置一些规则时才需要？

我的title，sku等的xpath仍然可能是错误的，但是起初，我只想构建我的基础知识，以爬网这些嵌套的页面

我的控制台输出：

最后我设法浏览了所有这些页面，但是不知何故，我的.csv文件仍然为空

Answer 1

根据您提供的评论，问题始于您跳过链中的请求。

您的start_urls将请求此页面：https://www.karton.eu/Faltkartons 该页面将通过parse方法进行解析，并产生从https://www.karton.eu/Karton-weiss到 https://www.karton.eu/Einwellige-Kartonagen

这些页面将使用parse_item方法进行解析，但它们不是您想要的最终页面。您需要在卡之间进行解析并产生新的请求，如下所示：

for url in response.xpath('//div[@class="cat-thumbnails"]/div/a/@href')
    yield scrapy.Request(response.urljoin(url.get()), callback=self.new_parsing_method)

此处的示例，在解析https://www.karton.eu/Zweiwellige-Kartons时将从中找到9个新链接

最后，您需要一种解析方法来抓取那些页面中的项目。由于有多个项目，建议您在for循环中运行它们。（您需要正确的xpath才能抓取数据。）

编辑：

现在重新编辑，我观察到页面结构，发现我的代码基于错误的假设。事实是，某些页面没有子类别页面，其他页面没有。

页面结构：

ROOT: www.karton.eu/Faltkartons
 |_ Einwellige Kartons
    |_ Subcategory: Kartons ab 100 mm Länge
      |_ Item List (www.karton.eu/einwellig-ab-100-mm)
        |_ Item Detail (www.karton.eu/113x113x100-mm-einwellige-Kartons)
    ...
    |_ Subcategory: Kartons ab 1000 mm Länge
      |_ ...
 |_ Zweiwellige Kartons #Same as above
 |_ Lange Kartons #Same as above
 |_ quadratische Kartons #There is no subcategory
    |_ Item List (www.karton.eu/quadratische-Kartons)
      |_ Item Detail (www.karton.eu/113x113x100-mm-einwellige-Kartons)
 |_ Kartons Höhenvariabel #There is no subcategory
 |_ Kartons weiß #There is no subcategory

下面的代码会从带有子类别的页面中抓取项目，因为我认为这就是您想要的。无论哪种方式，我都留下print语句来向您显示由于没有子类别页面而将被跳过的页面。如果您以后想要包含它们。

import scrapy
from ..items import KartonageItem

class KartonSpider(scrapy.Spider):
    name = "kartons12"
    allow_domains = ['karton.eu']
    start_urls = [
        'https://www.karton.eu/Faltkartons'
        ]
    custom_settings = {'FEED_EXPORT_FIELDS': ['SKU', 'Title', 'Link', 'Price', 'Delivery_Status', 'Weight', 'QTY', 'Volume'] } 
    
    def parse(self, response):
        url = response.xpath('//div[@class="cat-thumbnails"]')

        for a in url:
            link = a.xpath('a/@href')
            yield response.follow(url=link.get(), callback=self.parse_category_cartons)

    def parse_category_cartons(self, response):
        url2 = response.xpath('//div[@class="cat-thumbnails"]')

        if not url2:
            print('Empty url2:', response.url)

        for a in url2:
            link = a.xpath('a/@href')
            yield response.follow(url=link.get(), callback=self.parse_target_page)

    def parse_target_page(self, response):
        card = response.xpath('//div[@class="text-center artikelbox"]')

        for a in card:
            items = KartonageItem()
            link = a.xpath('a/@href')
            items ['SKU'] = a.xpath('.//div[@class="delivery-status"]/small/text()').get()
            items ['Title'] = a.xpath('.//h5[@class="title"]/a/text()').get()
            items ['Link'] = a.xpath('.//h5[@class="text-center artikelbox"]/a/@href').extract()
            items ['Price'] = a.xpath('.//strong[@class="price-ger price text-nowrap"]/span/text()').get()
            items ['Delivery_Status'] = a.xpath('.//div[@class="signal_image status-2"]/small/text()').get()
            yield response.follow(url=link.get(),callback=self.parse_item, meta={'items':items})

    def parse_item(self,response):
        table = response.xpath('//div[@class="product-info-inner"]')

        #items = KartonageItem() # You don't need this here, as the line bellow you are overwriting the variable.
        items = response.meta['items']
        items['Weight'] = response.xpath('.//span[@class="staffelpreise-small"]/text()').get()
        items['Volume'] = response.xpath('.//td[@class="icon_contenct"][7]/text()').get()
        yield items

注释

更改了此内容

    card = response.xpath('//div[@class="text-center articelbox"]')

对此：（用K代替C）

    card = response.xpath('//div[@class="text-center artikelbox"]')

对此进行了评论，因为meta中的项目已经是KartonageItem。（您可以将其删除）

def parse_item(self,response):
    table = response.xpath('//div[@class="product-info-inner"]')
    #items = KartonageItem()
    items = response.meta['items']

在 parse_items 方法中更改了该：

items['Weight'] = a.xpath('.//span[@class="staffelpreise-small"]/text()').get() items['Volume'] = a.xpath('.//td[@class="icon_contenct"][7]/text()').get()

对此：

items['Weight'] = response.xpath('.//span[@class="staffelpreise-small"]/text()').get() items['Volume'] = response.xpath('.//td[@class="icon_contenct"][7]/text()').get()

该方法不存在a。

抓取抓取嵌套网址

1 个答案:

编辑：

注释