Question

我正在使用Anemone来抓取一个域，但它运行正常。

启动抓取的代码如下所示：

require 'anemone'

Anemone.crawl("http://www.example.com/") do |anemone|
  anemone.on_every_page do |page|
      puts page.url
  end
end

这很好地打印出域的所有页面网址，如下所示：

http://www.example.com/
http://www.example.com/about
http://www.example.com/articles
http://www.example.com/articles/article_01
http://www.example.com/contact

我想要做的是使用键的url的最后一部分创建一个键值对数组，并为该值创建url'减去域'。

E.g。

[
   ['','/'],
   ['about','/about'],
   ['articles','/articles'],
   ['article_01','/articles/article_01']
]

道歉，如果这是基本的东西，但我是Ruby新手。

Answer 1

我会先在代码块之外定义一个数组或哈希值，然后将键值对添加到它中：

require 'anemone'

path_array = []
crawl_url = "http://www.example.com/"    

Anemone.crawl(crawl_url) do |anemone|
  anemone.on_every_page do |page|
    path_array << page.url
    puts page.url
  end
end

然后，您可以将数组映射到可用的多维数组中：

path_array.map{|x| [x[crawl_url.length..10000], x.gsub("http://www.example.com","")]}

=> [["", "/"], ["about", "/about"], ["articles", "/articles"], ["articles/article_01", "/articles/article_01"], ["contact", "/contact"]]

我不确定它是否适用于所有情况，但我认为这可以为您提供如何收集数据和操作数据的良好开端。此外，如果您想要一个键/值对，您应该查看Ruby的类Hash，以获取有关如何在Ruby中使用和创建哈希的更多信息。

Answer 2

最简单且可能最不稳健的方法是使用

page.url.split('/').last

获取您的密钥＆＃39;。您需要测试各种边缘情况以确保其可靠运行。

编辑：这将返回＆＃39; www.example.com＆＃39;作为＆＃39; http://www.example.com/＆＃39;的关键这不是您需要的结果

Anemone Ruby spider - 创建没有域名的键值数组

2 个答案: