我是新手,请原谅我的问题,
所以我有一个网址http://example.com/news?count=XX),我希望scrapy遍历所有计数(1,2,3,4,5),直到它到达空页(没有html)或404页
我的问题总数不详,所以我不确定如何告诉scrapy这样工作:
http://example.com/news?count=1 ===> found data, save it
http://example.com/news?count=2 ===> found data, save it
http://example.com/news?count=3 ===> found data, save it
....
....
....
http://example.com/news?count=X ===> no data found, stop here.
答案 0 :(得分:0)
只需编写蜘蛛代码即可:
class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ["example.com"]
start_urls = ["http://example.com/news?count=1"]
count = 1
def parse(self, response):
... make your magic! ...
self.count = self.count + 1
next_url = response.url[:-1] + str(self.count)
yield scrapy.Request(next_url, callback=self.parse)
如果您需要next_url
,显然必须改进count > 9
中的逻辑。