apache nutch crawler - 只检索单个url

时间:2016-05-20 18:13:26

标签: apache web-crawler nutch

INJECT步骤仅保留检索单个URL - 尝试抓取CNN。 我有默认配置(下面是nutch网站) - 那可能是什么 - 根据我的价值不应该是10个文档?

# app/config.yml

hwi_oauth:
    resource_owners:
        any_name:
            type:           linkedin
            client_id:      <client_id>
            client_secret:  <client_secret>
            scope:          <scope>

1 个答案:

答案 0 :(得分:0)

Nutch抓取包含4个基本步骤:生成,获取,解析和更新数据库nutch 1.xnutch 2.x的步骤相同。执行和完成所有四个步骤将构成一个爬网周期

Injector可以是将URL添加到crawldb的第一步;如上所述herehere

  

要填充webtable的初始行,您可以使用InjectorJob。

我认为你已经提供了即cnn.com

npm ERR! Linux 3.13.0-40-generic npm ERR! argv "/home/travis/build/borysn/spring-boot-angular2/src/main/web/node/node-v6.2.0-linux-x64/bin/node" "/home/travis/build/borysn/spring-boot-angular2/src/main/web/node_modules/npm/bin/npm-cli.js" "install" npm ERR! node v6.2.0 npm ERR! npm v3.9.2 npm ERR! file sh npm ERR! code ELIFECYCLE npm ERR! errno ENOENT npm ERR! syscall spawn npm ERR! spring-boot-angular2@0.0.1-SNAPSHOT postinstall: `typings install` npm ERR! spawn ENOENT npm ERR! npm ERR! Failed at the spring-boot-angular2@0.0.1-SNAPSHOT postinstall script 'typings install'. npm ERR! Make sure you have the latest version of node.js and npm installed. npm ERR! If you do, this is most likely a problem with the spring-boot-angular2 package, npm ERR! not with npm itself. npm ERR! Tell the author that this fails on your system: npm ERR! typings install npm ERR! You can get information on how to open an issue for this project with: npm ERR! npm bugs spring-boot-angular2 npm ERR! Or if that isn't available, you can get their info via: npm ERR! npm owner ls spring-boot-angular2 npm ERR! There is likely additional logging output above. npm ERR! Please include the following file with any support request: npm ERR! /home/travis/build/borysn/spring-boot-angular2/src/main/web/npm-debug.log :npmInstall FAILED 限制了单个域中提取的网址数量here

现在重要的是来自您的抓取网cnn.com的网址数量。

选项1

你有generate.max.count = 10并且你有seeded或者向抓取注入10个以上的网址,然后在执行抓取周期时,nutch应该获取不超过10个网址

选项2

如果您只注入了一个URL并且您只执行了一个爬网循环,那么在第一个循环中,您将只获得一个文档,因为您的crawldb中只有一个URL。您的crawldb将在每个爬网周期结束时更新。因此,在执行第二个爬网周期和第三个爬网周期等时,nutch应该只解析来自特定域的最多10个URL。