是否可以执行以下操作,但可以使用多个url,如下所示? 每个链接将有大约50页要爬网和循环。当前的解决方案有效,但是仅当我使用1个URL而不是多个URL时才有效。
start_urls = [
'https://www.xxxxxxx.com.au/home-garden/page-%s/c18397' % page for page in range(1, 50),
'https://www.xxxxxxx.com.au/automotive/page-%s/c21159' % page for page in range(1, 50),
'https://www.xxxxxxx.com.au/garden/page-%s/c25449' % page for page in range(1, 50),
]
答案 0 :(得分:0)
我们可以通过使用另一个列表来执行操作。我在下面共享了它的代码。希望这就是您想要的。
final_urls=[]
start_urls = [
'https://www.xxxxxxx.com.au/home-garden/page-%s/c18397',
'https://www.xxxxxxx.com.au/automotive/page-%s/c21159',
'https://www.xxxxxxx.com.au/garden/page-%s/c25449']
final_urls.extend(url % page for page in range(1, 50) for url in start_urls)
输出片段
final_urls[1:20]
['https://www.xxxxxxx.com.au/automotive/page-1/c21159',
'https://www.xxxxxxx.com.au/garden/page-1/c25449',
'https://www.xxxxxxx.com.au/home-garden/page-2/c18397',
'https://www.xxxxxxx.com.au/automotive/page-2/c21159',
'https://www.xxxxxxx.com.au/garden/page-2/c25449',
'https://www.xxxxxxx.com.au/home-garden/page-3/c18397',
'https://www.xxxxxxx.com.au/automotive/page-3/c21159',
'https://www.xxxxxxx.com.au/garden/page-3/c25449',
'https://www.xxxxxxx.com.au/home-garden/page-4/c18397',
'https://www.xxxxxxx.com.au/automotive/page-4/c21159',
'https://www.xxxxxxx.com.au/garden/page-4/c25449',
'https://www.xxxxxxx.com.au/home-garden/page-5/c18397',
'https://www.xxxxxxx.com.au/automotive/page-5/c21159',
'https://www.xxxxxxx.com.au/garden/page-5/c25449',
'https://www.xxxxxxx.com.au/home-garden/page-6/c18397',
'https://www.xxxxxxx.com.au/automotive/page-6/c21159',
'https://www.xxxxxxx.com.au/garden/page-6/c25449',
'https://www.xxxxxxx.com.au/home-garden/page-7/c18397',
'https://www.xxxxxxx.com.au/automotive/page-7/c21159']
关于您的最新查询,您尝试过吗?
def parse(self, response):
for link in final_urls:
request = scrapy.Request(link)
yield request
答案 1 :(得分:0)
我建议为此使用start_requests
def start_requests(self):
base_urls = [
'https://www.xxxxxxx.com.au/home-garden/page-{page_number}/c18397',
'https://www.xxxxxxx.com.au/automotive/page-{page_number}/c21159',
'https://www.xxxxxxx.com.au/garden/page-{page_number}/c25449',
]
for page in range(1, 50):
for base_url in base_urls:
url = base_url.format( page_number=page )
yield scrapy.Request( url, callback=self.parse )