网络爬行与Nutch

时间:2017-04-02 12:14:37

标签: solr nutch

我使用的是Nutch(Nutch 1.4)和Solr(3.4.0)的旧版本,因为我在以后的版本中遇到了安装问题。安装后我运行了爬网,现在我希望将已爬网的URL转储到文本文件中。这些是Nutch 1.4上可用的选项:

Abhijeet@Abhijeet /home/apache-nutch-1.4-bin/runtime/local/bin
$ ./nutch
Usage: nutch [-core] COMMAND
where COMMAND is one of:
  crawl             one-step crawler for intranets
  readdb            read / dump crawl db
  mergedb           merge crawldb-s, with optional filtering
  readlinkdb        read / dump link db
  inject            inject new urls into the database
  generate          generate new segments to fetch from crawl db
  freegen           generate new segments to fetch from text files
  fetch             fetch a segment's pages
  parse             parse a segment's pages
  readseg           read / dump segment data
  mergesegs         merge several segments, with optional filtering and slicing
  updatedb          update crawl db from segments after fetching
  invertlinks       create a linkdb from parsed segments
  mergelinkdb       merge linkdb-s, with optional filtering
  solrindex         run the solr indexer on parsed segments and linkdb
  solrdedup         remove duplicates from solr
  solrclean         remove HTTP 301 and 404 documents from solr
  parsechecker      check the parser for a given url
  indexchecker      check the indexing filters for a given url
  domainstats       calculate domain statistics from crawldb
  webgraph          generate a web graph from existing segments
  linkrank          run a link analysis program on the generated web graph
  scoreupdater      updates the crawldb with linkrank scores
  nodedumper        dumps the web graph's node scores
  plugin            load a plugin and run one of its classes main()
  junit             runs the given JUnit test
 or
  CLASSNAME         run the class named CLASSNAME
Most commands print help when invoked w/o parameters.

Expert: -core option is for developers only. It avoids building the job jar,
        instead it simply includes classes compiled with ant compile-core.
        NOTE: this works only for jobs executed in 'local' mode

有两个选项 - readdb和readlinkdb。我需要运行这两个中的哪一个?这两个命令的格式也分别如下所述:

对于readdb

 Abhijeet@Abhijeet /home/apache-nutch-1.4-bin/runtime/local/bin
    $ ./nutch readdb
    cygpath: can't convert empty path
    Usage: CrawlDbReader <crawldb> (-stats | -dump <out_dir> | -topN <nnnn> <out_dir> [<min>] | -url <url>)
            <crawldb>       directory name where crawldb is located
            -stats [-sort]  print overall statistics to System.out
                    [-sort] list status sorted by host
            -dump <out_dir> [-format normal|csv ]   dump the whole db to a text file in <out_dir>
                    [-format csv]   dump in Csv format
                    [-format normal]        dump in standard format (default option)
            -url <url>      print information on <url> to System.out
            -topN <nnnn> <out_dir> [<min>]  dump top <nnnn> urls sorted by score to <out_dir>
                    [<min>] skip records with scores below this value.
                            This can significantly improve performance.

对于readlinkdb

Abhijeet@Abhijeet /home/apache-nutch-1.4-bin/runtime/local/bin
$ ./nutch readlinkdb
cygpath: can't convert empty path
Usage: LinkDbReader <linkdb> (-dump <out_dir> | -url <url>)
        -dump <out_dir> dump whole link db to a text file in <out_dir>
        -url <url>      print information about <url> to System.out

我对如何正确使用这两个命令感到困惑。一个例子会有很大的帮助。

修改:

所以我成功运行了readdb选项并获得了以下结果:

http://www.espncricinfo.com/    Version: 7
Status: 2 (db_fetched)
Fetch time: Sat Apr 15 20:40:38 IST 2017
Modified time: Thu Jan 01 05:30:00 IST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0042857
Signature: b7324a43f084e5b291ec56ccfb552a2a
Metadata: _pst_: success(1), lastModified=0

http://www.espncricinfo.com/afghanistan-v-ireland-2016-17/content/series/1040469.html   Version: 7
Status: 2 (db_fetched)
Fetch time: Sat Apr 15 20:43:03 IST 2017
Modified time: Thu Jan 01 05:30:00 IST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 0.080714285
Signature: f3bf66dc7c6cd440ee01819b29149140
Metadata: _pst_: success(1), lastModified=0

http://www.espncricinfo.com/afghanistan-v-ireland-2016-17/engine/match/1040485.html Version: 7
Status: 1 (db_unfetched)
Fetch time: Thu Mar 16 20:43:51 IST 2017
Modified time: Thu Jan 01 05:30:00 IST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 0.0014285715
Signature: null
Metadata:

但另一方面,运行readlinkdb选项会转储一个空文件。关于可能出错的任何想法。

这是我的readlinkdb命令:./ nutch readlinkdb myCrawl2 / linkdb -dump myCrawl2 / LinkDump

0 个答案:

没有答案