Question

我正在制作一个简单的维基百科页面抓取工具，并将详细信息写入运行redis的远程服务器。

 1 The crawler asks the server for a page that needs crawling
 2 The crawler loads the page and adds the pages that are found to an internal buffer
 3 When the page has finished being parsed the results are sent to the server

我该怎么做：

保留在服务器上找到的所有页面，并带有一个标志，指出页面是否已被抓取..

e.g

我的问题是。

我怎样才能让redis给我第一个链接，状态为0（尚未抓取）然后我如何告诉redis将该状态更改为1（在我抓取它之后）

Answer 1

您可以使用列表来保存要处理的页面

RPUSH mylist "http:// ...."

然后你可以使用lpop来获取列表中的第一项

LPOP mylist

要跟踪已处理的页面，您可以使用一组

SADD myset "http://.....

最后聚集地址在处理过的集合中

SISMEMBER myset "http://...."

将redis nosql与webcrawler一起使用

1 个答案: