我有一个关于scrapy的非常简单的问题。我想用start_url刮取一个网站www.example.com/1。然后我想访问www.example.com/2和www.example.com/3,依此类推。我知道这应该很简单,但是,怎么办呢?
这是我的刮刀,不能更简单:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "scraper"
start_urls = [
'http://www.example.com/1',
]
def parse(self, response):
for quote in response.css('#Ficha'):
yield {
'item_1': quote.css('div.ficha_med > div > h1').extract(),
}
现在,我该如何进入http://www.example.com/2?
答案 0 :(得分:2)
向您的班级添加start_requests
方法,并根据需要生成这些请求:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "scraper"
def start_requests(self):
n = ??? # set the limit here
for i in range(1, n):
yield scrapy.Request('http://www.example.com/{}'.format(i), self.parse)
def parse(self, response):
for quote in response.css('#Ficha'):
yield {
'item_1': quote.css('div.ficha_med > div > h1').extract(),
}
另一种选择是,您可以在start_urls
参数中添加多个网址:
class QuotesSpider(scrapy.Spider):
name = "scraper"
start_urls = ['http://www.example.com/{}'.format(i) for i in range(1, 100)]
# choose your limit here ^^^
def parse(self, response):
for quote in response.css('#Ficha'):
yield {
'item_1': quote.css('div.ficha_med > div > h1').extract(),
}
答案 1 :(得分:2)
试试这个:
import scrapy
from scrapy.http import Request
class QuotesSpider(scrapy.Spider):
name = "scraper"
number_of_pages = 10 # number of pages you want to parse
start_urls = [
'http://www.example.com/1',
]
def start_requests(self):
for i in range(self.number_of_pages):
yield Request('http://www.example.com/%d' % i, callback = self.parse)
def parse(self, response):
for quote in response.css('#Ficha'):
yield {
'item_1': quote.css('div.ficha_med > div > h1').extract(),
}