在获取时HttpBase.getRobotRules中的Nutch NoHttpResponseException

时间:2016-11-15 13:30:52

标签: java wordpress web-crawler nutch

我是nutch和Linux的新手。我正在开发一个使用nutch 2.1来抓取网站的旧Java项目。博客托管在https://iconewsblog.wordpress.com上。但是我无法抓取博客网站。

我的种子列表包含 - https://ico.org.uk

我在nutch-site.xml中将db.ignore.external.links设置为false,如下所示 -

<property>
 <name>db.ignore.external.links</name>
 <value>false</value>
 <description>If true, outlinks leading from a page to external hosts
 will be ignored. This will limit your crawl to the host on your seeds file.
 </description>
</property>

我已将博客网站添加到我的regex-urlfilter.txt中。

但是我无法抓取博客网站。它失败了,下面的堆栈跟踪 -

2016-11-14 12:36:41,992 INFO  fetcher.FetcherJob - fetching https://iconewsblog.wordpress.com/2015/07/17/new-opportunities-and-increased-accountability-public-sector-reuse-rules-come-into-force/
2016-11-14 12:36:41,992 DEBUG params.DefaultHttpParams - Set parameter http.protocol.version = HTTP/1.0
2016-11-14 12:36:41,992 DEBUG params.DefaultHttpParams - Set parameter http.protocol.unambiguous-statusline = false
2016-11-14 12:36:41,992 DEBUG params.DefaultHttpParams - Set parameter http.protocol.single-cookie-header = false
2016-11-14 12:36:41,992 DEBUG params.DefaultHttpParams - Set parameter http.protocol.strict-transfer-encoding = false
2016-11-14 12:36:41,992 DEBUG params.DefaultHttpParams - Set parameter http.protocol.reject-head-body = false
2016-11-14 12:36:41,992 DEBUG params.DefaultHttpParams - Set parameter http.protocol.warn-extra-input = false
2016-11-14 12:36:41,992 DEBUG params.DefaultHttpParams - Set parameter http.protocol.status-line-garbage-limit = 2147483647
2016-11-14 12:36:41,992 DEBUG params.DefaultHttpParams - Set parameter http.protocol.content-charset = UTF-8
2016-11-14 12:36:41,992 DEBUG params.DefaultHttpParams - Set parameter http.protocol.cookie-policy = compatibility
2016-11-14 12:36:41,992 DEBUG params.DefaultHttpParams - Set parameter http.protocol.single-cookie-header = true
2016-11-14 12:36:41,992 DEBUG httpclient.MultiThreadedHttpConnectionManager - HttpConnectionManager.getConnection:  config = HostConfiguration[host=https://iconewsblog.wordpress.com], timeout = 10000
2016-11-14 12:36:41,992 DEBUG httpclient.MultiThreadedHttpConnectionManager - Reclaiming connection, hostConfig=HostConfiguration[host=https://ico.org.uk]
2016-11-14 12:36:41,992 DEBUG httpclient.MultiThreadedHttpConnectionManager - Allocating new connection, hostConfig=HostConfiguration[host=https://iconewsblog.wordpress.com]
2016-11-14 12:36:41,992 DEBUG httpclient.HttpConnection - Open connection to iconewsblog.wordpress.com:443
2016-11-14 12:36:42,036 DEBUG httpclient.HttpMethodBase - Adding Host request header
2016-11-14 12:36:42,088 DEBUG httpclient.HttpMethodDirector - Closing the connection.
2016-11-14 12:36:42,088 INFO  httpclient.HttpMethodDirector - I/O exception (org.apache.commons.httpclient.NoHttpResponseException) caught when processing request: The server iconewsblog.wordpress.com failed to respond
2016-11-14 12:36:42,088 DEBUG httpclient.HttpMethodDirector - The server iconewsblog.wordpress.com failed to respond
org.apache.commons.httpclient.NoHttpResponseException: The server iconewsblog.wordpress.com failed to respond
    at org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethodBase.java:1976)
    at org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.java:1735)
    at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1098)
    at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:398)
    at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
    at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
    at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
    at org.apache.nutch.protocol.httpclient.HttpResponse.<init>(HttpResponse.java:95)
    at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:172)
    at org.apache.nutch.protocol.http.api.RobotRulesParser.getRobotRulesSet(RobotRulesParser.java:458)
    at org.apache.nutch.protocol.http.api.RobotRulesParser.getRobotRulesSet(RobotRulesParser.java:443)
    at org.apache.nutch.protocol.http.api.HttpBase.getRobotRules(HttpBase.java:396)
    at org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:491)
2016-11-14 12:36:42,088 INFO  httpclient.HttpMethodDirector - Retrying request

如果有帮助 - 我的httpclient-auth.xml为空。这个设置是由一个不再和我公司合作的开发人员完成的。

任何帮助将不胜感激。感谢。

0 个答案:

没有答案