使用起始网址无法在分页链接中刮擦

时间:2018-08-14 01:09:52

标签: python web-scraping scrapy

我试图抓取一个带有分页链接的网站,所以我这样做了

import scrapy

class SummymartSpider(scrapy.Spider):
    name = 'dummymart'
    allowed_domains = ['www.dummrmart.com/product']
    start_urls = ['https://www.dummymart.net/product/auto-parts--118?page%s'% page for page in range(1,20)]

成功了!!使用单个网址即可,但是当我尝试执行此操作时:

  import scrapy
    class DummymartSpider(scrapy.Spider):
        name = 'dummymart'
        allowed_domains = ['www.dummymart.com/product']
        start_urls = ['https://www.dummymart.net/product/auto-parts--118?page%s',
        'https://www.dummymart.net/product/accessories-tools--112?id=1316264860?page%s'% page for page in range(1,20)]

它不起作用,但是对于多个URL,我如何实现相同的逻辑?谢谢

1 个答案:

答案 0 :(得分:0)

一种实现方法是使用start_requests()的{​​{1}}方法而不是使用scrapy.Spider属性。 You can see more here

start_urls

如果您想继续使用import scrapy class DummymartSpider(scrapy.Spider): name = 'dummymart' allowed_domains = ['dummymart.com'] def start_requests(self): for page in range(1,20): yield scrapy.Request( url='https://www.dummymart.net/product/auto-parts--118?page%s' % page, callback=self.parse, ) yield scrapy.Request( url='https://www.dummymart.net/product/accessories-tools--112?id=1316264860?page%s' % page, callback=self.parse, ) 属性,可以尝试执行以下操作(我尚未对其进行测试):

start_urls

还要注意,在start_urls = ['https://www.dummymart.net/product/auto-parts--118?page%s' % page for page in range(1,20)] + ['https://www.dummymart.net/product/accessories-tools--112?id=1316264860?page%s'% page for page in range(1,20)] 属性中,您只需要指定域。 See here