Question

我需要抓取一个网站，它基本上有这样的链接：

www.website.com/link/page_1.html
www.website.com/link/page_2.html
www.website.com/link/page_3.html
...

抓取的内容将通过管道直接进入数据库。

很容易告诉django：

if item exists do not insert it, otherwise insert it

但有没有办法刮掉自上次刮掉以来添加的其余链接？

例如，在website.com插入新项目之后：

/link/page_1.html becomes /link/page_2.html
new items populate /link/page_1.html

在这一点上，我需要告诉scrapy只是刮掉自上次刮伤后添加的新物品？

Answer 1

最新的scrapy支持将请求序列化到磁盘[1]，还有Rolando的Redis集成[2]。