使用MongoDB的Apache nutch crawler没有获取正确的URL

时间:2017-05-23 23:41:54

标签: nutch

我在我的CentOS 6.7 VM上安装了Apache Nutch,并将其配置为将输出保存到MongoDB。

但问题是它没有抓取正确的网址,或者没有返回正确的网址。你认为这可能是因为网站的安全性。

我的conf / regex-urlfilter.txt有以下条目:

# accept anything else
+^http://*.*

seed.txt(仅用于测试目的)有

[abc@X.X.X.X local]$ cat urls/seed.txt
http://www.sears.com/

我遵循的步骤是Inject - > generate - > fetch - > parse - > updatedb

[abc@X.X.X.X local]$ bin/nutch inject urls/
InjectorJob: starting at 2017-05-23 18:26:08
InjectorJob: Injecting urlDir: urls
InjectorJob: Using class org.apache.gora.mongodb.store.MongoStore as the Gora storage class.
InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and filtering: 1
Injector: finished at 2017-05-23 18:26:11, elapsed: 00:00:02
[abc@X.X.X.X local]$ bin/nutch generate -topN 80
GeneratorJob: starting at 2017-05-23 18:26:17
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: true
GeneratorJob: normalizing: true
GeneratorJob: topN: 80
GeneratorJob: finished at 2017-05-23 18:26:21, time elapsed: 00:00:03
GeneratorJob: generated batch id: 1495581977-876634391 containing 1 URLs
[abc@X.X.X.X local]$ bin/nutch fetch -all
FetcherJob: starting at 2017-05-23 18:26:32
FetcherJob: fetching all
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
Using queue mode : byHost
Fetcher: threads: 10
fetching https://www.facebook.com/LinioEcuador/ (queue crawl delay=5000ms)
fetching https://www.clubpremier.com/mx/conocenos/niveles/ (queue crawl delay=5000ms)
fetching https://twitter.com/LinioEcuador/ (queue crawl delay=5000ms)
fetching https://www.instagram.com/clubpremier/ (queue crawl delay=5000ms)
fetching https://reservaciones.clubpremier.com/profiles/itineraries.cfm (queue crawl delay=5000ms)
fetching https://s3.amazonaws.com/club_premier/logo-cp.svg (queue crawl delay=5000ms)
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
QueueFeeder finished: total 49 records. Hit by time limit :0
fetching https://www.facebook.com/clubpremiermexico (queue crawl delay=5000ms)
fetching https://s3.amazonaws.com/club_premier/clubpremier-components-info/images/logo-cp.svg (queue crawl delay=5000ms)
fetching https://twitter.com/clubpremier_mx (queue crawl delay=1000ms)
10/10 spinwaiting/active, 4 pages, 0 errors, 0.8 1 pages/s, 1151 1151 kb/s, 40 URLs in 2 queues
fetching https://www.clubpremier.com/mx/acumula/compra/multiplica-puntos-premier (queue crawl delay=5000ms)
fetching https://reservaciones.clubpremier.com/travel/arc.cfm (queue crawl delay=5000ms)
10/10 spinwaiting/active, 6 pages, 0 errors, 0.6 0 pages/s, 798 445 kb/s, 38 URLs in 1 queues
fetching https://www.clubpremier.com/mx/acumula/compra/adquiere-puntos-premier/ (queue crawl delay=5000ms)
10/10 spinwaiting/active, 7 pages, 0 errors, 0.5 0 pages/s, 606 223 kb/s, 37 URLs in 1 queues
fetching https://www.clubpremier.com/mx/acumula/aerolineas/skyteam/ (queue crawl delay=5000ms)

您可以在上面看到生成的网址与我要抓取的网站完全没有关系。请帮我解决这个问题。

谢谢, 希尔帕

1 个答案:

答案 0 :(得分:1)

看起来URL过滤器配置为接受www中的每个页面。如果打算将抓取限制在域sears.com中的页面,则规则可能类似于

# allow pages in the domain sears.com
+^https?://([a-z0-9]+\.)*sears\.com
# skip anything else
-.*

另请参阅以下配置属性:

<property>
  <name>db.ignore.external.links</name>
  <value>false</value>
  <description>If true, outlinks leading from a page to external hosts or domain
  will be ignored. This is an effective way to limit the crawl to include
  only initially injected hosts, without creating complex URLFilters.
  See 'db.ignore.external.links.mode'.
  </description>
</property>

<property>
  <name>db.ignore.external.links.mode</name>
  <value>byHost</value>
  <description>Alternative value is byDomain</description>
</property>