Question

运行Nutch 1.10并且我在使用Nutch开发人员提供的抓取脚本时遇到了麻烦：

Usage: crawl [-i|--index] [-D "key=value"] <Seed Dir> <Crawl Dir> <Num     Rounds>
    -i|--index      Indexes crawl results into a configured indexer
    -D              A Java property to pass to Nutch calls
    Seed Dir        Directory in which to look for a seeds file
    Crawl Dir       Directory where the crawl/link/segments dirs are saved
    Num Rounds      The number of rounds to run this crawl for
 Example: bin/crawl -i -D solr.server.url=http://localhost:8983/solr/ urls/ TestCrawl/  2

我想知道是否有人可以给我一些阅读本文的见解。例如：

    -i|--index      **What is the configured indexer? Is this part of Nutch? Or is it an another program like Solr? When I put in -i, what am I doing?**
    -D              **Not sure how these get used in the crawl but the instruction is pretty self-explanatory.**
    Seed Dir        **Self-explanatory but where do I put the directory within Nutch? I created a urls directory (per the instructions) in the apache-nutch-1.10 directory. I've also tried putting it in the apache-nutch-1.10/bin file because that is were the crawl starts from.**
    Crawl Dir       **Is this where the results of the crawl go or is there where the data for the injection to the crawldb goes? If its the latter where do I get said data? The directory starts out empty and never gets filled. Confusing!**
    Num Rounds      **Self-explanatory**

其他问题：抓取的结果在哪里？他们是否必须去Solr核心（或其他一些软件）？他们可以去一个目录，所以我可以看看它们吗？他们出现了什么格式？

谢谢！

Answer 1

-i：是一个类似Solr / ElasticSearch等的程序。因此，当您指定-i选项时，爬网脚本会运行索引作业，否则会跳过它。

Crawl Dir：是存储抓取数据的目录。这包括crawldb，segments和linkdb。所以基本上所有与抓取相关的数据都在这里。

抓取的结果会进入您指定的crawlDir。它存储为序列文件，并有命令来查看数据。

您可以在 - https://wiki.apache.org/nutch/CommandLineOptions找到它们。

Nutch Crawl Script

1 个答案: