Question

这是我的第一个小项目-诚然，这也是我第一次使用python进行的练习。我正在寻找一种方法来抓取多个子页面，将内容合并/附加到单个值，并将数据BACK / UP传递给原始父页面。每个父页面的子页面数也是可变的-可以少至1，但永远不会为0（可能有助于错误处理？）。此外，子页面可能会重复并重新出现，因为它们不是单个父页面独有的。我设法将父页面元数据向下传递到相应的子页面，但是在完成反向操作时遇到了麻烦。

这是示例页面结构：

Top Level Domain
     - Pagination/Index Page #1 (parse recipe links)
          - Recipe #1 (select info & parse ingredient links)
               - Ingredient #1 (select info)
               - Ingredient #2 (select info)
               - Ingredient #3 (select info)
          - Recipe #2
               - Ingredient #1
          - Recipe #3
               - Ingredient #1
               - Ingredient #2
     - Pagination/Index Page #2
          - Recipe #N
               - Ingredient #N
               - ...
     - Pagination/Index Page #3
     - ... continued

我要寻找的输出（每个配方）如下所示：

{
"recipe_title": "Gin & Tonic",
"recipe_posted_date": "May 2, 2019",
"recipe_url": "www.XYZ.com/gandt.html",
"recipe_instructions": "<block of text here>",
"recipe_ingredients": ["gin", "tonic water", "lime wedge"],
"recipe_calorie_total": "135 calories",
"recipe_calorie_list": ["60 calories", "70 calories", "5 calories"]
}

我正在从各自的食谱页面中提取每种成分的URL。我需要从每个成分页面中提取卡路里计数，将其与其他成分的卡路里计数合并，并理想地产生一个项目。由于一种成分并不只限于一种食谱，因此我需要稍后在抓取中重新访问成分页面。

（注意-这不是真实的示例，因为卡路里数显然会根据配方所需的量而变化）

我发布的代码使我更接近所要查找的内容，但我必须想象有解决问题的更优雅的方法。所发布的代码可以成功地将配方的元数据向下传递到配料级别，遍历配料并附加卡路里数。由于信息已被传递，因此我将在配料级别进行生产，并创建许多配方重复项（每个配料一个），直到我遍历最后一个配料为止。在此阶段，我希望添加成分索引号，以便我可以某种方式保留每个配方URL中具有最大成分索引号的记录。在达到这一点之前，我想我会向专业人士寻求指导。

抓取代码：

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from recipe_scraper.items import RecipeItem

class RecipeSpider(CrawlSpider):
    name = 'Recipe'
    allowed_domains = ['www.example.com']
    start_urls = ['https://www.example.com/recipes/']
    rules = (
        Rule(
            LinkExtractor(
                allow=()
                ,restrict_css=('.pagination')
                ,unique=True
            )
            ,callback='parse_index_page'
            ,follow=True
        ),
    )

def parse_index_page(self, response):
    print('Processing Index Page.. ' + response.url)
    index_url = response.url
    recipe_urls = response.css('.recipe > a::attr(href)').getall()
    for a in recipe_urls:
        request = scrapy.Request(a, callback=self.parse_recipe_page)
        request.meta['index_url'] = index_url
        yield request

def parse_recipe_page(self, response):
    print('Processing Recipe Page.. ' + response.url)
    Recipe_url = response.url
    Recipe_title = response.css('.Recipe_title::text').extract()[0]
    Recipe_posted_date = response.css('.Recipe_posted_date::text').extract()[0]
    Recipe_instructions = response.css('.Recipe_instructions::text').extract()[0]
    Recipe_ingredients = response.xpath('//ul[@class="ingredients"]//li[@class="ingredient"]/a/text()').getall()
    Recipe_ingredient_urls = response.xpath('//ul[@class="ingredients"]//li[@class="ingredient"]/a/@href').getall()
    Recipe_calorie_list_append = []
    Recipe_calorie_list = []
    Recipe_calorie_total = []
    Recipe_item = RecipeItem()
    Recipe_item['index_url'] = response.meta["index_url"]
    Recipe_item['Recipe_url'] = Recipe_url
    Recipe_item['Recipe_title'] = Recipe_title
    Recipe_item['Recipe_posted_date'] = Recipe_posted_date
    Recipe_item['Recipe_instructions'] = Recipe_instructions
    Recipe_item['Recipe_ingredients'] = Recipe_ingredients
    Recipe_item['Recipe_ingredient_urls'] = Recipe_ingredient_urls
    Recipe_item['Recipe_ingredient_url_count'] = len(Recipe_ingredient_urls)
    Recipe_calorie_list.clear()
    Recipe_ingredient_url_index = 0
    while Recipe_ingredient_url_index < len(Recipe_ingredient_urls):
        ingredient_request = scrapy.Request(Recipe_ingredient_urls[Recipe_ingredient_url_index], callback=self.parse_ingredient_page, dont_filter=True)
        ingredient_request.meta['Recipe_item'] = Recipe_item
        ingredient_request.meta['Recipe_calorie_list'] = Recipe_calorie_list
        yield ingredient_request
        Recipe_calorie_list_append.append(Recipe_calorie_list)
        Recipe_ingredient_url_index += 1

def parse_ingredient_page(self, response):
    print('Processing Ingredient Page.. ' + response.url)
    Recipe_item = response.meta['Recipe_item']
    Recipe_calorie_list = response.meta["Recipe_calorie_list"]
    ingredient_url = response.url
    ingredient_calorie_total = response.css('div.calorie::text').getall()
    Recipe_calorie_list.append(ingredient_calorie_total)
    Recipe_item['Recipe_calorie_list'] = Recipe_calorie_list
    yield Recipe_item
    Recipe_calorie_list.clear()

按原样，我的输出不理想，如下所示（请注意卡路里列表）：

{
"recipe_title": "Gin & Tonic",
"recipe_posted_date": "May 2, 2019",
"recipe_url": "www.XYZ.com/gandt.html",
"recipe_instructions": "<block of text here>",
"recipe_ingredients": ["gin", "tonic water", "lime wedge"],
"recipe_calorie_total": [],
"recipe_calorie_list": ["60 calories"]
},
{
"recipe_title": "Gin & Tonic",
"recipe_posted_date": "May 2, 2019",
"recipe_url": "www.XYZ.com/gandt.html",
"recipe_instructions": "<block of text here>",
"recipe_ingredients": ["gin", "tonic water", "lime wedge"],
"recipe_calorie_total": [],
"recipe_calorie_list": ["60 calories", "70 calories"]
},
{
"recipe_title": "Gin & Tonic",
"recipe_posted_date": "May 2, 2019",
"recipe_url": "www.XYZ.com/gandt.html",
"recipe_instructions": "<block of text here>",
"recipe_ingredients": ["gin", "tonic water", "lime wedge"],
"recipe_calorie_total": [],
"recipe_calorie_list": ["60 calories", "70 calories", "5 calories"]
}

Answer 1

一种解决方案是将配方和配料分别作为不同的项目刮取，然后在抓取完成后进行一些后处理，例如使用常规Python，根据需要合并配方和配料数据。这是最有效的解决方案。

或者，您可以从配方响应中提取成分URL，而不是一次产生对所有成分的请求，您可以产生第一个成分的请求，并将其余成分URL保存到新请求meta中，以及食谱项目。收到成分响应后，您将所有需要的信息解析为meta，并产生对下一个成分URL的新请求。如果没有更多的成分URL，则可以产生完整的配方项目。

例如：

def _handle_next_ingredient(self, recipe, ingredient_urls):
    try:
        return Request(
            ingredient_urls.pop(),
            callback=self.parse_ingredient,
            meta={'recipe': recipe, 'ingredient_urls': ingredient_urls},
        )
    except IndexError:
        return recipe

def parse_recipe(self, response):
    recipe = {}, ingredient_urls = []
    # [Extract needed data into recipe and ingredient URLs into ingredient_urls]
    yield self._handle_next_ingredient(recipe, ingredient_urls)

def parse_ingredient(self, response):
    recipe = response.meta['recipe']
    # [Extend recipe with the information of this ingredient]
    yield self._handle_next_ingredient(recipe, response.meta['ingredient_urls'])

但是，请注意，如果两个或多个食谱可以具有相同的成分URL，则您将必须在请求中添加dont_filter=True，并重复相同成分的多个请求。如果配料URL不是特定于食谱的，则认真考虑第一个建议。

如何解析多个子页面，合并/追加并向上传递到父级？

1 个答案: