Question

我是新手，请原谅我的问题，

所以我有一个网址http://example.com/news?count=XX），我希望scrapy遍历所有计数（1,2,3,4,5），直到它到达空页（没有html）或404页

我的问题总数不详，所以我不确定如何告诉scrapy这样工作：

http://example.com/news?count=1 ===> found data, save it
http://example.com/news?count=2 ===> found data, save it
http://example.com/news?count=3 ===> found data, save it
....
....
....
http://example.com/news?count=X ===> no data found, stop here.

Answer 1

只需编写蜘蛛代码即可：

class ExampleSpider(scrapy.Spider):
  name = "example"
  allowed_domains = ["example.com"]
  start_urls = ["http://example.com/news?count=1"]
  count = 1

  def parse(self, response):
    ... make your magic! ...
    self.count = self.count + 1
    next_url = response.url[:-1] + str(self.count)
    yield scrapy.Request(next_url, callback=self.parse)

如果您需要next_url，显然必须改进count > 9中的逻辑。

如果以未知总数进行分页，则达到404时停止

1 个答案: