使用Scrapy 1.5抓取多级菜单

时间:2018-11-21 14:20:46

标签: python scrapy scrapy-spider

我正在尝试从多级菜单中获取所有链接。
start_urls = ['https://www.bbcgoodfood.com/recipes/category/ingredients']

import scrapy

from foodisgood.items import FoodisgoodItem
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst


class BbcSpider(CrawlSpider):

    name = 'bbc'
    allowed_domains = ['bbcgoodfood.com']

    start_urls = ['https://www.bbcgoodfood.com/recipes/category/ingredients']

    rules = (
        Rule(LinkExtractor(allow=(r'/recipes/category/[\w-]+$'), restrict_xpaths='//article[contains(@class, "cleargridindent")]'), callback='parse_sub_categories', follow=True),
        Rule(LinkExtractor(allow=(r'/recipes/collection/[\w-]+$'), restrict_xpaths='//article[contains(@class, "cleargridindent")]'), callback='parse_collections', follow=True),
    )

    def parse_sub_categories(self, response):
        l = ItemLoader(item=FoodisgoodItem(), response=response)

        l.default_output_processor = TakeFirst()

        l.add_xpath('category_title', '//h1[@class="section-head--title"]/text()')
        l.add_value('page_url', response.url)

        yield l.load_item()

    def parse_collections(self, response):
        l = ItemLoader(item=FoodisgoodItem(), response=response)

        l.default_output_processor = TakeFirst()

        l.add_xpath('collection_title', '//h1[@class="section-head--title"]/text()')
        l.add_value('page_url', response.url)

        yield l.load_item()

Results of menu scraping 但是我不明白如何在集合标题之前填充空的第一列。

现在我有:

空|牛排食谱| https://www.bbcgoodfood.com/recipes/collection/steak

但是我需要:

肉|牛排食谱| https://www.bbcgoodfood.com/recipes/collection/steak

有人可以建议我在第一栏中获得子类别的结果需要做什么吗?

感谢大家)

1 个答案:

答案 0 :(得分:0)

使用UFUNCTION的规则并不能真正实现您想要的(至少不是简单的方法)。

Passing additional data to callback functions中记录了执行此操作的常用方法。
您将在第一个回调中提取类别,然后创建一个新请求,并在CrawlSpider字典中传递此信息。