Question

下一句话引起了我对Wget手册的关注

wget --spider --force-html -i bookmarks.html

This feature needs much more work for Wget to get close to the functionality of real web spiders.

我在wget中找到了与蜘蛛选项相关的以下代码行。

src/ftp.c
780:      /* If we're in spider mode, don't really retrieve anything.  The
784:      if (opt.spider)
889:  if (!(cmd & (DO_LIST | DO_RETR)) || (opt.spider && !(cmd & DO_LIST)))
1227:      if (!opt.spider)
1239:      if (!opt.spider)
1268:      else if (!opt.spider)
1827:          if (opt.htmlify && !opt.spider)

src/http.c
64:#include "spider.h"
2405:  /* Skip preliminary HEAD request if we're not in spider mode AND
2407:  if (!opt.spider
2428:      if (opt.spider && !got_head)
2456:      /* Default document type is empty.  However, if spider mode is
2570:           * spider mode.  */
2571:          else if (opt.spider)
2661:              if (opt.spider)

src/res.c
543:  int saved_sp_val = opt.spider;
548:  opt.spider       = false;
551:  opt.spider       = saved_sp_val;  

src/spider.c
1:/* Keep track of visited URLs in spider mode.
37:#include "spider.h"
49:spider_cleanup (void)

src/spider.h
1:/* Declarations for spider.c

src/recur.c
52:#include "spider.h"
279:      if (opt.spider)
366:              || opt.spider /* opt.recursive is implicitely true */
370:             (otherwise unneeded because of --spider or rejected by -R) 
375:                   (opt.spider ? "--spider" : 
378:                     (opt.delete_after || opt.spider
440:      if (opt.spider) 

src/options.h
62:  bool spider;           /* Is Wget in spider mode? */

src/init.c
238:  { "spider",           &opt.spider,            cmd_boolean },

src/main.c
56:#include "spider.h"
238:    { "spider", 0, OPT_BOOLEAN, "spider", -1 },
435:       --spider                  don't download anything.\n"),
1045:  if (opt.recursive && opt.spider)

我希望看到代码中的差异，而不是抽象。我喜欢代码示例。

网络蜘蛛与代码中的Wget蜘蛛有何不同？

Answer 1

我不确定评论的原始作者究竟是指什么，但我可以猜测wget作为蜘蛛很慢，因为它似乎只使用一个执行线程（至少通过你所显示的内容））。

heritrix等“真正的”蜘蛛使用了很多并行和技巧来优化他们的爬行速度，同时对他们正在爬行的网站很好。这通常意味着以每秒1次（或左右）的速率限制点击一个站点，并同时抓取多个网站。

根据我对蜘蛛的一般知识以及你在这里发布的内容，这只是猜测。

Answer 2

不幸的是，许多比较知名的“真正的”网络蜘蛛都是封闭源代码，实际上是封闭式二进制代码。然而，缺少一些基本技术：

平行;如果不一次检索多个页面，你将永远无法跟上整个网络
优先次序。有些页面对蜘蛛比其他页面更重要
限速;如果你继续尽可能快地拉下页面，你将被迅速禁止
保存到本地文件系统以外的其他内容; Web足够大，不适合单个目录树
定期重新检查页面而不重新启动整个过程;实际上，对于一个真正的蜘蛛，你需要经常重新检查“重要”页面以获取更新，而不那么有趣的页面可能会持续数月。

还可以使用各种其他输入，例如站点地图等。重点是，wget不是为了整个网络而设计的，并不是一个可以在小代码示例中捕获的东西，因为这是整个技术使用的问题，而不是任何一个小的子程序都是错误的为了这项任务。

Answer 3

我不打算详细介绍如何蜘蛛网，我认为wget评论是关于抓住一个仍然是一个严峻挑战的网站。

作为蜘蛛，您需要弄清楚何时停止，而不是仅仅因为URL更改为日期= 1/1/1900到1/2/1900而进入递归爬行
挑选网址重写的更大挑战（我不知道google或其他任何处理这个问题的方式）。爬行足够但不太多，这是一个相当大的挑战。以及如何通过一些随机参数和内容中的随机变化自动识别URL Rewrite？
您需要至少解析某个级别的Flash / Javascript
您需要考虑一些疯狂的HTTP问题，例如 base 标记。即使解析HTML也不容易，因为大多数网站都不是XHTML，而且浏览器的语法也非常灵活。

我不知道在wget中实现或考虑了多少这些，但你可能想看看httrack来理解这项任务的挑战。

我很乐意为您提供一些代码示例，但这是一项重要的任务，一个体面的蜘蛛将是5000左右没有第三方库。

+其中一些已经由@ yaakov-belch解释过所以我不会再打字了

网络蜘蛛如何与Wget的蜘蛛不同？

4 个答案: