Apache Nutch提取不适用于https url

时间:2019-06-10 13:52:49

标签: solr web-crawler nutch nutch2

我为Apache Nutch配置了Hbase和Solr,它非常适合http url。我需要抓取https网址。经过一番谷歌搜索后,我发现我需要启用protocol-httpclient。我已经用protocol-httpclient更新了我的nutch-default.xml和nutch-site.xml。运行bin/nutch fetch -all后-结果为

FetcherJob: starting at 2019-06-10 09:44:14
FetcherJob: fetching all
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
Using queue mode : byHost
Fetcher: threads: 10
QueueFeeder finished: total 2 records. Hit by time limit :0
fetching https://some-url.gov (queue crawl delay=5000ms)
fetching http://some-url.gov (queue crawl delay=5000ms)
-finishing thread FetcherThread2, activeThreads=2
-finishing thread FetcherThread3, activeThreads=2
-finishing thread FetcherThread4, activeThreads=2
-finishing thread FetcherThread5, activeThreads=2
-finishing thread FetcherThread6, activeThreads=2
-finishing thread FetcherThread7, activeThreads=2
-finishing thread FetcherThread8, activeThreads=2
Fetcher: throughput threshold: -1
-finishing thread FetcherThread9, activeThreads=2
Fetcher: throughput threshold sequence: 5
-finishing thread FetcherThread1, activeThreads=1
-finishing thread FetcherThread0, activeThreads=0
0/0 spinwaiting/active, 2 pages, 0 errors, 0.4 0 pages/s, 185 185 kb/s, 0 URLs in 0 queues
-activeThreads=0
Using queue mode : byHost
Fetcher: threads: 10
QueueFeeder finished: total 0 records. Hit by time limit :0
-finishing thread FetcherThread0, activeThreads=0
-finishing thread FetcherThread1, activeThreads=0
-finishing thread FetcherThread2, activeThreads=0
-finishing thread FetcherThread3, activeThreads=0
-finishing thread FetcherThread4, activeThreads=0
-finishing thread FetcherThread5, activeThreads=0
-finishing thread FetcherThread6, activeThreads=0
-finishing thread FetcherThread7, activeThreads=0
-finishing thread FetcherThread8, activeThreads=0
Fetcher: throughput threshold: -1
-finishing thread FetcherThread9, activeThreads=0
Fetcher: throughput threshold sequence: 5
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs in 0 queues
-activeThreads=0
FetcherJob: finished at 2019-06-10 09:44:26, time elapsed: 00:00:12

请指导。感谢您的帮助

0 个答案:

没有答案