这是我的第一个小项目-诚然,这也是我第一次使用python进行的练习。我正在寻找一种方法来抓取多个子页面,将内容合并/附加到单个值,并将数据BACK / UP传递给原始父页面。每个父页面的子页面数也是可变的-可以少至1,但永远不会为0(可能有助于错误处理?)。此外,子页面可能会重复并重新出现,因为它们不是单个父页面独有的。我设法将父页面元数据向下传递到相应的子页面,但是在完成反向操作时遇到了麻烦。
这是示例页面结构:
Top Level Domain
- Pagination/Index Page #1 (parse recipe links)
- Recipe #1 (select info & parse ingredient links)
- Ingredient #1 (select info)
- Ingredient #2 (select info)
- Ingredient #3 (select info)
- Recipe #2
- Ingredient #1
- Recipe #3
- Ingredient #1
- Ingredient #2
- Pagination/Index Page #2
- Recipe #N
- Ingredient #N
- ...
- Pagination/Index Page #3
- ... continued
我要寻找的输出(每个配方)如下所示:
{
"recipe_title": "Gin & Tonic",
"recipe_posted_date": "May 2, 2019",
"recipe_url": "www.XYZ.com/gandt.html",
"recipe_instructions": "<block of text here>",
"recipe_ingredients": ["gin", "tonic water", "lime wedge"],
"recipe_calorie_total": "135 calories",
"recipe_calorie_list": ["60 calories", "70 calories", "5 calories"]
}
我正在从各自的食谱页面中提取每种成分的URL。我需要从每个成分页面中提取卡路里计数,将其与其他成分的卡路里计数合并,并理想地产生一个项目。由于一种成分并不只限于一种食谱,因此我需要稍后在抓取中重新访问成分页面。
(注意-这不是真实的示例,因为卡路里数显然会根据配方所需的量而变化)
我发布的代码使我更接近所要查找的内容,但我必须想象有解决问题的更优雅的方法。所发布的代码可以成功地将配方的元数据向下传递到配料级别,遍历配料并附加卡路里数。由于信息已被传递,因此我将在配料级别进行生产,并创建许多配方重复项(每个配料一个),直到我遍历最后一个配料为止。在此阶段,我希望添加成分索引号,以便我可以某种方式保留每个配方URL中具有最大成分索引号的记录。在达到这一点之前,我想我会向专业人士寻求指导。
抓取代码:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from recipe_scraper.items import RecipeItem
class RecipeSpider(CrawlSpider):
name = 'Recipe'
allowed_domains = ['www.example.com']
start_urls = ['https://www.example.com/recipes/']
rules = (
Rule(
LinkExtractor(
allow=()
,restrict_css=('.pagination')
,unique=True
)
,callback='parse_index_page'
,follow=True
),
)
def parse_index_page(self, response):
print('Processing Index Page.. ' + response.url)
index_url = response.url
recipe_urls = response.css('.recipe > a::attr(href)').getall()
for a in recipe_urls:
request = scrapy.Request(a, callback=self.parse_recipe_page)
request.meta['index_url'] = index_url
yield request
def parse_recipe_page(self, response):
print('Processing Recipe Page.. ' + response.url)
Recipe_url = response.url
Recipe_title = response.css('.Recipe_title::text').extract()[0]
Recipe_posted_date = response.css('.Recipe_posted_date::text').extract()[0]
Recipe_instructions = response.css('.Recipe_instructions::text').extract()[0]
Recipe_ingredients = response.xpath('//ul[@class="ingredients"]//li[@class="ingredient"]/a/text()').getall()
Recipe_ingredient_urls = response.xpath('//ul[@class="ingredients"]//li[@class="ingredient"]/a/@href').getall()
Recipe_calorie_list_append = []
Recipe_calorie_list = []
Recipe_calorie_total = []
Recipe_item = RecipeItem()
Recipe_item['index_url'] = response.meta["index_url"]
Recipe_item['Recipe_url'] = Recipe_url
Recipe_item['Recipe_title'] = Recipe_title
Recipe_item['Recipe_posted_date'] = Recipe_posted_date
Recipe_item['Recipe_instructions'] = Recipe_instructions
Recipe_item['Recipe_ingredients'] = Recipe_ingredients
Recipe_item['Recipe_ingredient_urls'] = Recipe_ingredient_urls
Recipe_item['Recipe_ingredient_url_count'] = len(Recipe_ingredient_urls)
Recipe_calorie_list.clear()
Recipe_ingredient_url_index = 0
while Recipe_ingredient_url_index < len(Recipe_ingredient_urls):
ingredient_request = scrapy.Request(Recipe_ingredient_urls[Recipe_ingredient_url_index], callback=self.parse_ingredient_page, dont_filter=True)
ingredient_request.meta['Recipe_item'] = Recipe_item
ingredient_request.meta['Recipe_calorie_list'] = Recipe_calorie_list
yield ingredient_request
Recipe_calorie_list_append.append(Recipe_calorie_list)
Recipe_ingredient_url_index += 1
def parse_ingredient_page(self, response):
print('Processing Ingredient Page.. ' + response.url)
Recipe_item = response.meta['Recipe_item']
Recipe_calorie_list = response.meta["Recipe_calorie_list"]
ingredient_url = response.url
ingredient_calorie_total = response.css('div.calorie::text').getall()
Recipe_calorie_list.append(ingredient_calorie_total)
Recipe_item['Recipe_calorie_list'] = Recipe_calorie_list
yield Recipe_item
Recipe_calorie_list.clear()
按原样,我的输出不理想,如下所示(请注意卡路里列表):
{
"recipe_title": "Gin & Tonic",
"recipe_posted_date": "May 2, 2019",
"recipe_url": "www.XYZ.com/gandt.html",
"recipe_instructions": "<block of text here>",
"recipe_ingredients": ["gin", "tonic water", "lime wedge"],
"recipe_calorie_total": [],
"recipe_calorie_list": ["60 calories"]
},
{
"recipe_title": "Gin & Tonic",
"recipe_posted_date": "May 2, 2019",
"recipe_url": "www.XYZ.com/gandt.html",
"recipe_instructions": "<block of text here>",
"recipe_ingredients": ["gin", "tonic water", "lime wedge"],
"recipe_calorie_total": [],
"recipe_calorie_list": ["60 calories", "70 calories"]
},
{
"recipe_title": "Gin & Tonic",
"recipe_posted_date": "May 2, 2019",
"recipe_url": "www.XYZ.com/gandt.html",
"recipe_instructions": "<block of text here>",
"recipe_ingredients": ["gin", "tonic water", "lime wedge"],
"recipe_calorie_total": [],
"recipe_calorie_list": ["60 calories", "70 calories", "5 calories"]
}
答案 0 :(得分:0)
一种解决方案是将配方和配料分别作为不同的项目刮取,然后在抓取完成后进行一些后处理,例如使用常规Python,根据需要合并配方和配料数据。这是最有效的解决方案。
或者,您可以从配方响应中提取成分URL,而不是一次产生对所有成分的请求,您可以产生第一个成分的请求,并将其余成分URL保存到新请求meta
中,以及食谱项目。收到成分响应后,您将所有需要的信息解析为meta
,并产生对下一个成分URL的新请求。如果没有更多的成分URL,则可以产生完整的配方项目。
例如:
def _handle_next_ingredient(self, recipe, ingredient_urls):
try:
return Request(
ingredient_urls.pop(),
callback=self.parse_ingredient,
meta={'recipe': recipe, 'ingredient_urls': ingredient_urls},
)
except IndexError:
return recipe
def parse_recipe(self, response):
recipe = {}, ingredient_urls = []
# [Extract needed data into recipe and ingredient URLs into ingredient_urls]
yield self._handle_next_ingredient(recipe, ingredient_urls)
def parse_ingredient(self, response):
recipe = response.meta['recipe']
# [Extend recipe with the information of this ingredient]
yield self._handle_next_ingredient(recipe, response.meta['ingredient_urls'])
但是,请注意,如果两个或多个食谱可以具有相同的成分URL,则您将必须在请求中添加dont_filter=True
,并重复相同成分的多个请求。如果配料URL不是特定于食谱的,则认真考虑第一个建议。