Question

基本蜘蛛工作。然后我将其转换为CrawlSpider和规则，不幸的是，现在蜘蛛不再工作了。

基本蜘蛛在产品详细信息页面上进行了测试： https://www.ah.nl/producten/product/wi395939/ah-kleintje-boerenkool 然后它将获得指定的项目。

我的兴趣是使用CrawlSpider浏览所有奖励文章https://www.ah.nl/bonus 转到产品详细信息页面并获取指定的信息。

我该如何修复我的代码，以便Spider可以再次工作？
有人可以解释我对规则的错误
我也想排除response.xpath（“ // div [包含（@ class，'product-sidebar__products'）]”））如果此“ anderen kochten ook”（英语：“这两个产品同时供其他客户使用”）出现在产品详细信息页面上 https://www.ah.nl/producten/product/wi160917/ah-verse-pesto-groen在这里 https://www.ah.nl/producten/product/wi220252/swiffer-vloerreiniger-navul-stofdoekjes在这里不存在

我已经尝试了很多事情，但是无法理解规则

class ahSpider(CrawlSpider):

    name = 'ah'
    allowed_domains = ['ah.nl']  # geen url neer zetten alleen domain name
    start_urls = ['https://www.ah.nl']

    # "anderen kochten ook" "in English: “other customers both these products"
    # response.xpath("//div[contains(@class,'product-sidebar__products')]")

    rules = [
            Rule(LinkExtractor(allow=('/bonus'), deny=('/allerhandebox/', '/allerhande/', '/winkels/', '/acties/', '/klantenservice/', '/zakelijk/', '/bezorgbundel/', '/vakslager/')), follow=True),
        Rule(LinkExtractor(allow=('/producten/product/[0-9]+/[0-9]+'),), callback='parse_items'),
    ]

    #def parse(self, response):
    def parse_items(self, response):
        items = AhItem()

        product_name = response.xpath("//span[contains(@class, 'line-clamp--active')]//text()").extract_first()

        items['product_name']           = product_name
        yield items

Answer 1

主要问题似乎来自表达式'[0-9] + / [0-9] +'。页面上的链接具有产品详细信息链接，其样式分别为“ https://www.ah.nl/producten/product/wi460830/edet-ultra-soft-tp-magnolia-4-laags”，“ https://www.ah.nl/producten/product/wi210145/heineken-premium-pilsener”。如果您将表达式更改为allow=('/producten/product/')，这些产品详细信息链接将不再被过滤掉。
在1以下解释
您可以在parse_items方法下面包括以下内容：

from scrapy.exceptions import DropItem
others = response.xpath('//div[contains(@class,"product-sidebar__products")]')
if others:
  raise DropItem("'others also bought' present on the product_detail page")

Answer 2

感谢您的反应。我没有任何错误，只是一个空文件。希望您能对代码提供一些反馈？

谢谢！罗布

cra草的爬虫规则

2 个答案: