Question

我有一百万个要获取的网址列表。我使用此列表作为nutch种子，并使用Nutch的基本 crawl 命令来获取它们。但是，我发现Nutch会自动获取不在列表中的URL。我将爬网参数设置为-depth 1 -topN 1000000.但它不起作用。有谁知道怎么做？

Answer 1

在nutch-site.xml中设置此属性。（默认情况下为true，因此它会向crawldb添加outlinks）

<property>
  <name>db.update.additions.allowed</name>
  <value>false</value>
  <description>If true, updatedb will add newly discovered URLs, if false
  only already existing URLs in the CrawlDb will be updated and no new
  URLs will be added.
  </description>
</property>

Answer 2

删除抓取和网址目录（如果之前创建）
创建和更新种子文件（其中URL列出每行1URL）
重新启动抓取过程

命令

nutch crawl urllist -dir crawl -depth 3 -topN 1000000

urllist - 种子文件（网址列表）存在的目录
抓取 - 目录名称

即使问题仍然存在，请尝试删除您的nutch文件夹并重新启动整个过程。

使用Nutch抓取指定的URL列表

2 个答案: