Question

我正在尝试使用 Nutch 2.3 与 HBase 0.94.14 抓取整个特定网站（忽略外部链接）。

我已经按照一步一步的教程（可以找到它here）来了解如何设置和使用这些工具。但是，我未能实现我的目标。 Nutch仅在第一轮中检索该基本URL，而不是抓取我在seed.txt文件中写入其URL的整个网站。我需要进一步抓取以便Nutch检索更多的URL。

问题是我不知道为了抓取整个网站需要多少回合，所以我需要一种方法告诉Nutch“继续爬行直到整个网站被抓取”（换句话说，“在一轮中抓取整个网站“）。

以下是我到目前为止所遵循的关键步骤和设置：

将基本网址放在 seed.txt 文件中。

http://www.whads.com/

设置Nutch的 nutch-site.xml 配置文件。完成本教程后，我根据其他StackOverflow问题的建议添加了一些属性（但是，它们都没有解决我的问题）。

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->
<configuration>
        <property>
            <name>http.agent.name</name>
            <value>test-crawler</value>
        </property>
        <property>
            <name>storage.data.store.class</name>
            <value>org.apache.gora.hbase.store.HBaseStore</value>
        </property>
        <property>
            <name>plugin.includes</name>
            <value>protocol-httpclient|urlfilter-regex|index-(basic|more)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)</value>
        </property>
        <property>
            <name>db.ignore.external.links</name>
            <value>true</value>
        </property>
        <property>
            <name>db.ignore.internal.links</name>
            <value>false</value>
        </property>
        <property>
            <name>fetcher.max.crawl.delay</name>
            <value>-1</value>
        </property>
        <property>
            <name>fetcher.threads.per.queue</name>
            <value>50</value>
            <description></description>
        </property>
        <property> 
            <name>generate.count.mode</name> 
            <value>host</value>
        </property>
        <property> 
            <name>generate.max.count</name> 
            <value>-1</value>
        </property>
</configuration>

根据有关StackOverflow和Nutch邮件列表的建议，在Nutch的 regex-urlfilter.txt 配置文件中添加了“接受任何其他”规则。

# Already tried these two filters (one at a time, 
# and each one combined with the 'anything else' one)
#+^http://www.whads.com
#+^http://([a-z0-9]*.)*whads.com/

# accept anything else
+.

抓取：我尝试过使用两种不同的方法（两种方法都产生相同的结果，第一轮只生成并提取了一个网址）：
- 使用bin/nutch（遵循教程）：
```
bin/nutch inject urls
bin/nutch generate -topN 50000
bin/nutch fetch -all
bin/nutch parse -all
bin/nutch updatedb -all
```
- 使用bin/crawl：
```
bin/crawl urls whads 1
```

我还缺少什么吗？难道我做错了什么？或者说Nutch不能一次抓取整个网站？

提前非常感谢你！

Answer 1

请按照以下

更新您的配置

    <property>
        <name>db.ignore.external.links</name>
        <value>false</value>
    </property>

实际上，您忽略了外部链接，即不抓取外部网址

Answer 2

在与Nutch玩了几天之后，尝试了我在互联网上发现的一切，我最终放弃了。有人说用Nutch一次性抓取一个反网站是不可能的。所以，如果遇到同样问题的人遇到这个问题，请按照我的方法做同样的事情：删除Nutch并使用像Scrapy（Python）这样的东西。你需要手动设置蜘蛛，但它就像一个魅力，更具可扩展性和更快，结果更好。

Answer 3

您在末尾使用-1尝试过。我可以看到您在最后只使用1来运行爬网一次。

为什么Nutch（v2.3）只抓取种子网址，而不是抓取整个网站？

3 个答案: