Question

我运行我的抓取工具（见下文），但它只抓取'start_urls'中给出的页面。根据经验，我发现参数'restrict_xpaths'不起作用。

# -*- coding: utf-8 -*-

from scrapy.spiders import CrawlSpider, Rule
from ..items import Category
from scrapy import Selector
from scrapy.linkextractors import LinkExtractor


class NeoSpider(CrawlSpider):
    name = 'neo'
    allowed_domains = ['neopoliscasa.ru']
    start_urls = ['http://www.neopoliscasa.ru/catalog.html']
    identifier = 1
    subcategory_parent_id = None
    type_parent_id = None
    categories = []
    rules = (
        Rule(
            LinkExtractor(
                allow='/catalog/[a-z-]+.html',
                restrict_xpaths='//div[contains(@class, "itemTypeIcoon n")]'),
            callback='parse_subcategories'),
    )

    def parse(self, response):
        sel = Selector(response)
        category_blocks = sel.xpath(
            '//div[@class="rootCatalogItem"]')
        for item in category_blocks:
            category = Category()
            category['category'] = ''.join(item.xpath(
                'h2/a/text()').extract())
            category['id'] = unicode(self.identifier)
            category['parent_id'] = unicode(0)
            self.subcategory_parent_id = self.identifier
            self.identifier += 1
            self.categories.append(category)
            yield category

    def parse_subcategories(self, response):
        #  do anything
        pass

我该如何解决？感谢

Answer 1

问题是，在使用parse时，您不应该覆盖CrawlSpider功能，如docs中所述。

要解决您的问题，请将parse重命名为parse_，如果您要从第一个网站抓取数据，则重命名为parse_start_url。

然后在限制中使用其他class，因为网站中没有itemTypeIcoon条目。没有它你就不会得到任何结果。

或许itemArt可能是一个很好的解决方案。

Scrapy规则不适用于'restrict_xpaths'

1 个答案: