Question

我想要索引的网站相当大，有一百万页。我真的只想要一个所有URL的json文件，这样我就可以对它们进行一些操作（排序，分组等）。

基本的风声循环效果很好：

require 'anemone'

Anemone.crawl("http://www.example.com/") do |anemone|
  anemone.on_every_page do |page|
      puts page.url
  end
end

但是（因为网站大小？）终端冻结了一段时间后。因此，我安装了MongoDB并使用了以下

require 'rubygems'
require 'anemone'
require 'mongo'
require 'json'


$stdout = File.new('sitemap.json','w')


Anemone.crawl("http://www.mybigexamplesite.com/") do |anemone|
  anemone.storage = Anemone::Storage.MongoDB
  anemone.on_every_page do |page|
      puts page.url
  end
end

它现在正在运行，但如果我早上回来时json文件中有输出，我将非常非常感到惊讶 - 我以前从未使用过MongoDB和海葵文档的一部分关于使用存储不清楚（至少对我来说）。谁能做到这一点的人能给我一些提示吗？

Answer 1

如果有人需要＆lt; = 100,000个网址，Ruby Gem Spidr是一个很好的方式。

Answer 2

这可能不是您想要看到的答案，但我强烈建议您不要使用Anemone和Ruby来抓取一百万页。

Anemone不是一个维护的库，并且在许多边缘情况下失败。

Ruby是not the fastest language并使用global interpreter lock，这意味着您无法拥有真正的线程功能。我认为你的爬行可能会太慢。有关线程的更多信息，我建议您查看以下链接。

http://ablogaboutcode.com/2012/02/06/the-ruby-global-interpreter-lock/

Does ruby have real multithreading?

您可以尝试使用Rubinius或JRuby的海葵，但速度要快得多，但我不确定兼容程度。

我从Anemone到Nutch取得了一些成功，但你的里程可能会有所不同。

使用海葵宝石获取所有网址（非常大的网站）

2 个答案: