Question

遵循Apache Nutch教程：

如教程中所示，我已将regex-urlfilter.txt的最后一行设置为：

+^http://([a-z0-9]*\.)*nutch.apache.org/

我的nutch-site.xml文件只包含行

<property>
 <name>http.agent.name</name>
 <value>My Nutch Spider</value>
</property>

我的seed.txt文件是：

http://nutch.apache.org/

然而，当我爬

时

bin/nutch crawl urls -dir crawl -depth 3 -topN 5

我收到“无网址提取”错误。谁知道为什么？

Answer 1

配置看起来很好。您已在运行时/本地文件夹中进行了这些更改吗？ seed.txt将在NUTCH_HOME / runtime / local / urls文件夹中 regex-urlfilter.txt和nutch-site.xml将在NUTCH_HOME / runtime / local / conf文件夹中

NUTCH_HOME是安装目录

我正在遵循Nutch教程，并获得“无网址提取”错误

1 个答案: