如果以未知总数进行分页,则达到404时停止

时间:2016-05-16 19:07:15

标签: scrapy scrapy-spider

我是新手,请原谅我的问题,

所以我有一个网址http://example.com/news?count=XX),我希望scrapy遍历所有计数(1,2,3,4,5),直到它到达空页(没有html)或404页

我的问题总数不详,所以我不确定如何告诉scrapy这样工作:

http://example.com/news?count=1 ===> found data, save it
http://example.com/news?count=2 ===> found data, save it
http://example.com/news?count=3 ===> found data, save it
....
....
....
http://example.com/news?count=X ===> no data found, stop here.

1 个答案:

答案 0 :(得分:0)

只需编写蜘蛛代码即可:

class ExampleSpider(scrapy.Spider):
  name = "example"
  allowed_domains = ["example.com"]
  start_urls = ["http://example.com/news?count=1"]
  count = 1

  def parse(self, response):
    ... make your magic! ...
    self.count = self.count + 1
    next_url = response.url[:-1] + str(self.count)
    yield scrapy.Request(next_url, callback=self.parse)

如果您需要next_url,显然必须改进count > 9中的逻辑。