INJECT步骤仅保留检索单个URL - 尝试抓取CNN。 我有默认配置(下面是nutch网站) - 那可能是什么 - 根据我的价值不应该是10个文档?
# app/config.yml
hwi_oauth:
resource_owners:
any_name:
type: linkedin
client_id: <client_id>
client_secret: <client_secret>
scope: <scope>
答案 0 :(得分:0)
Nutch抓取包含4个基本步骤:生成,获取,解析和更新数据库。 nutch 1.x和nutch 2.x的步骤相同。执行和完成所有四个步骤将构成一个爬网周期。
Injector可以是将URL添加到crawldb的第一步;如上所述here和here。
要填充webtable的初始行,您可以使用InjectorJob。
我认为你已经提供了即cnn.com
npm ERR! Linux 3.13.0-40-generic
npm ERR! argv "/home/travis/build/borysn/spring-boot-angular2/src/main/web/node/node-v6.2.0-linux-x64/bin/node" "/home/travis/build/borysn/spring-boot-angular2/src/main/web/node_modules/npm/bin/npm-cli.js" "install"
npm ERR! node v6.2.0
npm ERR! npm v3.9.2
npm ERR! file sh
npm ERR! code ELIFECYCLE
npm ERR! errno ENOENT
npm ERR! syscall spawn
npm ERR! spring-boot-angular2@0.0.1-SNAPSHOT postinstall: `typings install`
npm ERR! spawn ENOENT
npm ERR!
npm ERR! Failed at the spring-boot-angular2@0.0.1-SNAPSHOT postinstall script 'typings install'.
npm ERR! Make sure you have the latest version of node.js and npm installed.
npm ERR! If you do, this is most likely a problem with the spring-boot-angular2 package,
npm ERR! not with npm itself.
npm ERR! Tell the author that this fails on your system:
npm ERR! typings install
npm ERR! You can get information on how to open an issue for this project with:
npm ERR! npm bugs spring-boot-angular2
npm ERR! Or if that isn't available, you can get their info via:
npm ERR! npm owner ls spring-boot-angular2
npm ERR! There is likely additional logging output above.
npm ERR! Please include the following file with any support request:
npm ERR! /home/travis/build/borysn/spring-boot-angular2/src/main/web/npm-debug.log
:npmInstall FAILED
限制了单个域中提取的网址数量here。
现在重要的是来自您的抓取网cnn.com的网址数量。
选项1
你有generate.max.count = 10并且你有seeded或者向抓取注入10个以上的网址,然后在执行抓取周期时,nutch应该获取不超过10个网址
选项2
如果您只注入了一个URL并且您只执行了一个爬网循环,那么在第一个循环中,您将只获得一个文档,因为您的crawldb中只有一个URL。您的crawldb将在每个爬网周期结束时更新。因此,在执行第二个爬网周期和第三个爬网周期等时,nutch应该只解析来自特定域的最多10个URL。