我是python和scrapy的新手,观看了一些udemy和youtube教程,现在尝试我的第一个示例。我知道如何循环播放(如果有一个下一步按钮)。但就我而言,没有。
这是我的代码,正在处理其中一个网址,但以后需要更改起始网址:
class Heroes1JobSpider(scrapy.Spider):
name = 'heroes1_job'
# where to extract
allowed_domains = ['icy-veins.com']
start_urls = ['https://www.icy-veins.com/heroes/alarak-build-guide']
def parse(self, response):
#what to extract
hero_names = response.xpath('//span[@class="page_breadcrumbs_item"]/text()').extract()
hero_buildss = response.xpath('//h3[@class="toc_no_parsing"]/text()').extract()
hero_buildskillss = response.xpath('//span[@class="heroes_build_talent_tier_visual"]').extract()
for item in zip(hero_names, hero_buildss, hero_buildskillss):
new_item = Heroes1Item()
new_item['hero_name'] = item[0]
new_item['hero_builds'] = item[1]
new_item['hero_buildskills'] = item[2]
yield new_item
但这只是一名英雄,我想要其中的90名。每个URL取决于英雄名称。 我可以通过以下命令获取网址列表:
start_urls = ['https://www.icy-veins.com/heroes/assassin-hero-guides')
...
response.xpath('//div[@class="nav_content_block_entry_heroes_hero"]/a/@href').extract()
但是我不知道如何存储此列表,以使解析函数遍历它们。
提前谢谢!
答案 0 :(得分:1)
在parse
函数中解析它们是否至关重要?您可以在一个函数中解析英雄列表,然后迭代此列表以这种方式抓取英雄数据:
from scrapy import Request
...
start_urls = ['https://www.icy-veins.com/heroes/assassin-hero-guides')
def parse(self, response):
heroes_xpath = '//div[@class="nav_content_block_entry_heroes_hero"]/a/@href'
for link in response.xpath(heroes_xpath).extract():
yield Request(response.urljoin(link), self.parse_hero)
def parse_hero(self, response):
# copying your method here
hero_names = response.xpath('//span[@class="page_breadcrumbs_item"]/text()').extract()
hero_buildss = response.xpath('//h3[@class="toc_no_parsing"]/text()').extract()
hero_buildskillss = response.xpath('//span[@class="heroes_build_talent_tier_visual"]').extract()
for item in zip(hero_names, hero_buildss, hero_buildskillss):
new_item = Heroes1Item()
new_item['hero_name'] = item[0]
new_item['hero_builds'] = item[1]
new_item['hero_buildskills'] = item[2]
yield new_item