我正在尝试使用Nutch 1.7抓取一些网址,但我正面临
身份验证问题和
连接被拒绝例外。
根据日志,我可以看到它正在尝试使用NTLM进行身份验证,但之后显示“需要重定向”并最终释放连接...(可以在logpart-1中看到)
根据
中的Nutch教程http://wiki.apache.org/nutch/HttpAuthenticationSchemes#A_note_on_NTLM_domains
我已在httpclient-auth.xml
文件中设置了auth-configuration:
httpclient
和nutch-site.xml
nutch-default.xml
属性
plugin.includes
协议 - (HttpClient的| HTTP)| urlfilter-
正则表达式| parse-(文|全文|蒂卡)|指数 - (更多|基本|锚)|索引-的Solr | scoring-
OPIC | urlnormalizer-(通|正则表达式|基本)
还在nutch-site.xml
。
http.auth.file HttpClient的-auth.xml “protocol-httpclient”插件的身份验证配置文件。
但我没有成功!
我没有以正确的方式配置身份验证,或者我错过了什么?
任何人都可以帮我在Nutch中使用正确的身份验证配置吗?
附加完整hadoop.log
:
logpart-1:authentication
2014-04-16 05:11:23,712 DEBUG httpclient.HttpMethodDirector - Authorization required
2014-04-16 05:11:23,712 DEBUG auth.AuthChallengeProcessor - Using authentication scheme: ntlm
2014-04-16 05:11:23,731 DEBUG auth.AuthChallengeProcessor - Authorization challenge processed
2014-04-16 05:11:23,733 DEBUG httpclient.HttpMethodDirector - Authentication scope: NTLM <any realm>@sp.zzz.com:80
2014-04-16 05:11:23,732 DEBUG fetcher.Fetcher - FetcherThread spin-waiting ...
2014-04-16 05:11:23,733 DEBUG fetcher.Fetcher - FetcherThread spin-waiting ...
2014-04-16 05:11:23,733 DEBUG httpclient.HttpMethodDirector - Retry authentication
2014-04-16 05:11:23,733 DEBUG fetcher.Fetcher - FetcherThread spin-waiting ...
2014-04-16 05:11:23,734 DEBUG httpclient.HttpMethodBase - Resorting to protocol version default close connection policy
2014-04-16 05:11:23,733 DEBUG cookie.CookieSpec - Unrecognized cookie attribute: name=HttpOnly, value=null
2014-04-16 05:11:23,734 DEBUG httpclient.HttpMethodBase - Should NOT close connection, using HTTP/1.1
2014-04-16 05:11:23,735 DEBUG httpclient.HttpMethodBase - Cookie accepted: "PHPSESSID=9f9378mvh9e720f5o3l0ibc1o7"
2014-04-16 05:11:23,735 DEBUG httpclient.HttpMethodDirector - Authenticating with NTLM <any realm>@sp.zzz.com:80
2014-04-16 05:11:23,735 DEBUG httpclient.HttpMethodDirector - Redirect required
2014-04-16 05:11:23,735 DEBUG params.HttpMethodParams - Credential charset not configured, using HTTP element charset
2014-04-16 05:11:23,735 DEBUG httpclient.HttpMethodBase - Should close connection in response to directive: close
2014-04-16 05:11:23,735 DEBUG httpclient.HttpConnection - Releasing connection back to connection manager.
2014-04-16 05:11:23,736 DEBUG httpclient.MultiThreadedHttpConnectionManager - Freeing connection, hostConfig=HostConfiguration[host=www.xxxportal.com]
2014-04-16 05:11:23,736 DEBUG util.IdleConnectionHandler - Adding connection at: 1397643083736
2014-04-16 05:11:23,736 DEBUG httpclient.MultiThreadedHttpConnectionManager - Notifying no-one, there are no waiting threads
2014-04-16 05:11:23,737 DEBUG httpclient.HttpMethodBase - Adding Host request header
2014-04-16 05:11:23,744 DEBUG httpclient.HttpMethodDirector - Authorization required
2014-04-16 05:11:23,744 DEBUG auth.AuthChallengeProcessor - Using authentication scheme: ntlm
2014-04-16 05:11:23,744 DEBUG auth.AuthChallengeProcessor - Authorization challenge processed
2014-04-16 05:11:23,744 DEBUG httpclient.HttpMethodDirector - Authentication scope: NTLM <any realm>@sp.zzz.com:80
2014-04-16 05:11:23,745 INFO regex.RegexURLNormalizer - can't find rules for scope 'fetcher', using default
2014-04-16 05:11:23,745 DEBUG httpclient.HttpMethodDirector - Credentials required
2014-04-16 05:11:23,745 DEBUG httpclient.HttpMethodDirector - Credentials provider not available
2014-04-16 05:11:23,745 INFO httpclient.HttpMethodDirector - Failure authenticating with NTLM <any realm>@sp.zzz.com:80
2014-04-16 05:11:23,745 DEBUG httpclient.HttpMethodBase - Resorting to protocol version default close connection policy
2014-04-16 05:11:23,745 DEBUG httpclient.HttpMethodBase - Should NOT close connection, using HTTP/1.1
2014-04-16 05:11:23,746 DEBUG httpclient.HttpConnection - Releasing connection back to connection manager.
2014-04-16 05:11:23,746 DEBUG httpclient.MultiThreadedHttpConnectionManager - Freeing connection, hostConfig=HostConfiguration[host=sp.zzz.com]
我收到的其他几个链接
处理请求时捕获到I / O异常(java.net.ConnectException):连接被拒绝:连接
我不在任何代理后面,我仍然关闭了系统中的所有防火墙设置。不知道为什么我得到连接拒绝异常。
在这里,我也无法找到连接拒绝例外的确切原因。
请帮助我理解这个案例的确切问题。
Attaching the complete hadoop.log
!
logPart2-connection
拒绝了。
2014-04-16 05:11:26,443 INFO fetcher.Fetcher - * queue: www.xxxportal.com
2014-04-16 05:11:26,443 INFO fetcher.Fetcher - maxThreads = 1
2014-04-16 05:11:26,444 INFO fetcher.Fetcher - inProgress = 0
2014-04-16 05:11:26,444 INFO fetcher.Fetcher - crawlDelay = 5000
2014-04-16 05:11:26,444 INFO fetcher.Fetcher - minCrawlDelay = 0
2014-04-16 05:11:26,444 INFO fetcher.Fetcher - nextFetchTime = 1397643088739
2014-04-16 05:11:26,444 INFO fetcher.Fetcher - now = 1397643086444
2014-04-16 05:11:26,444 INFO fetcher.Fetcher - 0. www.xxxportal.com/profiles/
2014-04-16 05:11:26,445 INFO fetcher.Fetcher - 1. www.xxxportal.com/wiki/index.php
2014-04-16 05:11:26,445 INFO fetcher.Fetcher - 2. www.xxxportal.com/sop/
2014-04-16 05:11:26,560 DEBUG httpclient.HttpMethodDirector - Closing the connection.
2014-04-16 05:11:26,560 INFO httpclient.HttpMethodDirector - I/O exception (java.net.ConnectException) caught when processing request: Connection refused: connect
2014-04-16 05:11:26,560 DEBUG httpclient.HttpMethodDirector - Connection refused: connect
java.net.ConnectException: Connection refused: connect
at java.net.DualStackPlainSocketImpl.waitForConnect(Native Method)
at java.net.DualStackPlainSocketImpl.socketConnect(DualStackPlainSocketImpl.java:85)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:172)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:579)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.commons.httpclient.protocol.ReflectionSocketFactory.createSocket(ReflectionSocketFactory.java:140)
at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:125)
at org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707)
at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.open(MultiThreadedHttpConnectionManager.java:1361)
at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387)
at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
at org.apache.nutch.protocol.httpclient.HttpResponse.<init>(HttpResponse.java:94)
at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:154)
at org.apache.nutch.protocol.http.api.HttpRobotRulesParser.getRobotRulesSet(HttpRobotRulesParser.java:75)
at org.apache.nutch.protocol.RobotRulesParser.getRobotRulesSet(RobotRulesParser.java:157)
at org.apache.nutch.protocol.http.api.HttpBase.getRobotRules(HttpBase.java:391)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:676)
2014-04-16 05:11:26,564 INFO httpclient.HttpMethodDirector - Retrying request
2014-04-16 05:11:26,565 DEBUG httpclient.HttpConnection - Open connection to www.zzzlearninglounge.com:80
答案 0 :(得分:1)
1)从日志中可以清楚地看到您的特定网站上的NTLM身份验证失败。
此处您必须先检查用户名/密码。
然后是Auth Basic / NTLM /的方案 然后是您要验证的端口
如果您验证这3点并使用正确的值,那么您的身份验证问题应该得到解决......