Question

我想用Ruby，Nokogiri和Mechanize抓住一家商店。

在显示两篇文章的页面上，我知道所有文章都以地址中的.../p/...开头，这就是我将其存储在article_links中的原因。应显示所有/p/个链接。

通常我会看到两个地址：

agent = Mechanize.new
page = agent.get(exampleshop.com)

article_links = page.links_with(href: %r{.*/p/})

article_links.map do |link|
    article = link.click
    target_URL = page.uri + link.uri #full URL
    puts "#{target_URL}"
end   
#crawling stuff on /p/ pages not included here

但是，最后每个链接都是重复的，这个链接已经在循环之前发生了，所以我看到了：

exampleshop.com/p/productxy.html

exampleshop.com/p/productxy.html

exampleshop.com/p/productab.html

exampleshop.com/p/productab.html

我相信网站代码中的每个产品都有两个/p/的href。有没有好办法防止这种情况发生？或者是否可以在links_with中使用Nokogiri CSS？

Answer 1

您可以在迭代列表之前删除重复项：

而不是

article_links.map do |link|

写

article.links.uniq { |link| link.uri }.map do |link|

将删除任何带有重复uri的链接。

您可以使用CSS regex selectors代替links_with，但您仍需要删除Ruby中的重复内容：

article_links = page.css("a[href*='/p/']")

你仍然需要在Ruby中删除重复项的原因是CSS无法选择匹配的第一个元素。 nth-of-type或nth-child在这里不起作用。

如何避免重复条目抓取网站

1 个答案: