我试图抓取一个带有分页链接的网站,所以我这样做了
import scrapy
class SummymartSpider(scrapy.Spider):
name = 'dummymart'
allowed_domains = ['www.dummrmart.com/product']
start_urls = ['https://www.dummymart.net/product/auto-parts--118?page%s'% page for page in range(1,20)]
成功了!!使用单个网址即可,但是当我尝试执行此操作时:
import scrapy
class DummymartSpider(scrapy.Spider):
name = 'dummymart'
allowed_domains = ['www.dummymart.com/product']
start_urls = ['https://www.dummymart.net/product/auto-parts--118?page%s',
'https://www.dummymart.net/product/accessories-tools--112?id=1316264860?page%s'% page for page in range(1,20)]
它不起作用,但是对于多个URL,我如何实现相同的逻辑?谢谢
答案 0 :(得分:0)
一种实现方法是使用start_requests()
的{{1}}方法而不是使用scrapy.Spider
属性。 You can see more here
start_urls
如果您想继续使用import scrapy
class DummymartSpider(scrapy.Spider):
name = 'dummymart'
allowed_domains = ['dummymart.com']
def start_requests(self):
for page in range(1,20):
yield scrapy.Request(
url='https://www.dummymart.net/product/auto-parts--118?page%s' % page,
callback=self.parse,
)
yield scrapy.Request(
url='https://www.dummymart.net/product/accessories-tools--112?id=1316264860?page%s' % page,
callback=self.parse,
)
属性,可以尝试执行以下操作(我尚未对其进行测试):
start_urls
还要注意,在start_urls = ['https://www.dummymart.net/product/auto-parts--118?page%s' % page for page in range(1,20)] + ['https://www.dummymart.net/product/accessories-tools--112?id=1316264860?page%s'% page for page in range(1,20)]
属性中,您只需要指定域。 See here。