如何解析多个子页面,合并/追加并向上传递到父级?

时间:2019-05-02 21:34:29

标签: python web-scraping scrapy

这是我的第一个小项目-诚然,这也是我第一次使用python进行的练习。我正在寻找一种方法来抓取多个子页面,将内容合并/附加到单个值,并将数据BACK / UP传递给原始父页面。每个父页面的子页面数也是可变的-可以少至1,但永远不会为0(可能有助于错误处理?)。此外,子页面可能会重复并重新出现,因为它们不是单个父页面独有的。我设法将父页面元数据向下传递到相应的子页面,但是在完成反向操作时遇到了麻烦。

这是示例页面结构:

Top Level Domain
     - Pagination/Index Page #1 (parse recipe links)
          - Recipe #1 (select info & parse ingredient links)
               - Ingredient #1 (select info)
               - Ingredient #2 (select info)
               - Ingredient #3 (select info)
          - Recipe #2
               - Ingredient #1
          - Recipe #3
               - Ingredient #1
               - Ingredient #2
     - Pagination/Index Page #2
          - Recipe #N
               - Ingredient #N
               - ...
     - Pagination/Index Page #3
     - ... continued

我要寻找的输出(每个配方)如下所示:

{
"recipe_title": "Gin & Tonic",
"recipe_posted_date": "May 2, 2019",
"recipe_url": "www.XYZ.com/gandt.html",
"recipe_instructions": "<block of text here>",
"recipe_ingredients": ["gin", "tonic water", "lime wedge"],
"recipe_calorie_total": "135 calories",
"recipe_calorie_list": ["60 calories", "70 calories", "5 calories"]
}

我正在从各自的食谱页面中提取每种成分的URL。我需要从每个成分页面中提取卡路里计数,将其与其他成分的卡路里计数合并,并理想地产生一个项目。由于一种成分并不只限于一种食谱,因此我需要稍后在抓取中重新访问成分页面。

(注意-这不是真实的示例,因为卡路里数显然会根据配方所需的量而变化)

我发布的代码使我更接近所要查找的内容,但我必须想象有解决问题的更优雅的方法。所发布的代码可以成功地将配方的元数据向下传递到配料级别,遍历配料并附加卡路里数。由于信息已被传递,因此我将在配料级别进行生产,并创建许多配方重复项(每个配料一个),直到我遍历最后一个配料为止。在此阶段,我希望添加成分索引号,以便我可以某种方式保留每个配方URL中具有最大成分索引号的记录。在达到这一点之前,我想我会向专业人士寻求指导。

抓取代码:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from recipe_scraper.items import RecipeItem

class RecipeSpider(CrawlSpider):
    name = 'Recipe'
    allowed_domains = ['www.example.com']
    start_urls = ['https://www.example.com/recipes/']
    rules = (
        Rule(
            LinkExtractor(
                allow=()
                ,restrict_css=('.pagination')
                ,unique=True
            )
            ,callback='parse_index_page'
            ,follow=True
        ),
    )

def parse_index_page(self, response):
    print('Processing Index Page.. ' + response.url)
    index_url = response.url
    recipe_urls = response.css('.recipe > a::attr(href)').getall()
    for a in recipe_urls:
        request = scrapy.Request(a, callback=self.parse_recipe_page)
        request.meta['index_url'] = index_url
        yield request

def parse_recipe_page(self, response):
    print('Processing Recipe Page.. ' + response.url)
    Recipe_url = response.url
    Recipe_title = response.css('.Recipe_title::text').extract()[0]
    Recipe_posted_date = response.css('.Recipe_posted_date::text').extract()[0]
    Recipe_instructions = response.css('.Recipe_instructions::text').extract()[0]
    Recipe_ingredients = response.xpath('//ul[@class="ingredients"]//li[@class="ingredient"]/a/text()').getall()
    Recipe_ingredient_urls = response.xpath('//ul[@class="ingredients"]//li[@class="ingredient"]/a/@href').getall()
    Recipe_calorie_list_append = []
    Recipe_calorie_list = []
    Recipe_calorie_total = []
    Recipe_item = RecipeItem()
    Recipe_item['index_url'] = response.meta["index_url"]
    Recipe_item['Recipe_url'] = Recipe_url
    Recipe_item['Recipe_title'] = Recipe_title
    Recipe_item['Recipe_posted_date'] = Recipe_posted_date
    Recipe_item['Recipe_instructions'] = Recipe_instructions
    Recipe_item['Recipe_ingredients'] = Recipe_ingredients
    Recipe_item['Recipe_ingredient_urls'] = Recipe_ingredient_urls
    Recipe_item['Recipe_ingredient_url_count'] = len(Recipe_ingredient_urls)
    Recipe_calorie_list.clear()
    Recipe_ingredient_url_index = 0
    while Recipe_ingredient_url_index < len(Recipe_ingredient_urls):
        ingredient_request = scrapy.Request(Recipe_ingredient_urls[Recipe_ingredient_url_index], callback=self.parse_ingredient_page, dont_filter=True)
        ingredient_request.meta['Recipe_item'] = Recipe_item
        ingredient_request.meta['Recipe_calorie_list'] = Recipe_calorie_list
        yield ingredient_request
        Recipe_calorie_list_append.append(Recipe_calorie_list)
        Recipe_ingredient_url_index += 1

def parse_ingredient_page(self, response):
    print('Processing Ingredient Page.. ' + response.url)
    Recipe_item = response.meta['Recipe_item']
    Recipe_calorie_list = response.meta["Recipe_calorie_list"]
    ingredient_url = response.url
    ingredient_calorie_total = response.css('div.calorie::text').getall()
    Recipe_calorie_list.append(ingredient_calorie_total)
    Recipe_item['Recipe_calorie_list'] = Recipe_calorie_list
    yield Recipe_item
    Recipe_calorie_list.clear()

按原样,我的输出不理想,如下所示(请注意卡路里列表):

{
"recipe_title": "Gin & Tonic",
"recipe_posted_date": "May 2, 2019",
"recipe_url": "www.XYZ.com/gandt.html",
"recipe_instructions": "<block of text here>",
"recipe_ingredients": ["gin", "tonic water", "lime wedge"],
"recipe_calorie_total": [],
"recipe_calorie_list": ["60 calories"]
},
{
"recipe_title": "Gin & Tonic",
"recipe_posted_date": "May 2, 2019",
"recipe_url": "www.XYZ.com/gandt.html",
"recipe_instructions": "<block of text here>",
"recipe_ingredients": ["gin", "tonic water", "lime wedge"],
"recipe_calorie_total": [],
"recipe_calorie_list": ["60 calories", "70 calories"]
},
{
"recipe_title": "Gin & Tonic",
"recipe_posted_date": "May 2, 2019",
"recipe_url": "www.XYZ.com/gandt.html",
"recipe_instructions": "<block of text here>",
"recipe_ingredients": ["gin", "tonic water", "lime wedge"],
"recipe_calorie_total": [],
"recipe_calorie_list": ["60 calories", "70 calories", "5 calories"]
}

1 个答案:

答案 0 :(得分:0)

一种解决方案是将配方和配料分别作为不同的项目刮取,然后在抓取完成后进行一些后处理,例如使用常规Python,根据需要合并配方和配料数据。这是最有效的解决方案。

或者,您可以从配方响应中提取成分URL,而不是一次产生对所有成分的请求,您可以产生第一个成分的请求,并将其余成分URL保存到新请求meta中,以及食谱项目。收到成分响应后,您将所有需要的信息解析为meta,并产生对下一个成分URL的新请求。如果没有更多的成分URL,则可以产生完整的配方项目。

例如:

def _handle_next_ingredient(self, recipe, ingredient_urls):
    try:
        return Request(
            ingredient_urls.pop(),
            callback=self.parse_ingredient,
            meta={'recipe': recipe, 'ingredient_urls': ingredient_urls},
        )
    except IndexError:
        return recipe

def parse_recipe(self, response):
    recipe = {}, ingredient_urls = []
    # [Extract needed data into recipe and ingredient URLs into ingredient_urls]
    yield self._handle_next_ingredient(recipe, ingredient_urls)

def parse_ingredient(self, response):
    recipe = response.meta['recipe']
    # [Extend recipe with the information of this ingredient]
    yield self._handle_next_ingredient(recipe, response.meta['ingredient_urls'])

但是,请注意,如果两个或多个食谱可以具有相同的成分URL,则您将必须在请求中添加dont_filter=True,并重复相同成分的多个请求。如果配料URL不是特定于食谱的,则认真考虑第一个建议。