使用nutch进行爬网时,身份验证和连接被拒绝错误

时间:2014-04-16 11:51:13

标签: nutch

我正在尝试使用Nutch 1.7抓取一些网址,但我正面临

  1. 身份验证问题和

  2. 连接被拒绝例外。

  3. 根据日志,我可以看到它正在尝试使用NTLM进行身份验证,但之后显示“需要重定向”并最终释放连接...(可以在logpart-1中看到)

  4. 根据

    中的Nutch教程
      

    http://wiki.apache.org/nutch/HttpAuthenticationSchemes#A_note_on_NTLM_domains

    1. 我已在httpclient-auth.xml文件中设置了auth-configuration:

    2. httpclientnutch-site.xml

      中定义的nutch-default.xml属性

      plugin.includes 协议 - (HttpClient的| HTTP)| urlfilter-
      正则表达式| parse-(文|全文|蒂卡)|指数 - (更多|基本|锚)|索引-的Solr | scoring-
      OPIC | urlnormalizer-(通|正则表达式|基本)

    3. 还在nutch-site.xml

      中定义了auth配置文件

      http.auth.file HttpClient的-auth.xml “protocol-httpclient”插件的身份验证配置文件。

    4. 但我没有成功!

      我没有以正确的方式配置身份验证,或者我错过了什么?

      任何人都可以帮我在Nutch中使用正确的身份验证配置吗?


      附加完整hadoop.log

      logpart-1:authentication

      2014-04-16 05:11:23,712 DEBUG httpclient.HttpMethodDirector - Authorization required
      2014-04-16 05:11:23,712 DEBUG auth.AuthChallengeProcessor - Using authentication scheme: ntlm
      2014-04-16 05:11:23,731 DEBUG auth.AuthChallengeProcessor - Authorization challenge processed
      2014-04-16 05:11:23,733 DEBUG httpclient.HttpMethodDirector - Authentication scope: NTLM <any realm>@sp.zzz.com:80
      2014-04-16 05:11:23,732 DEBUG fetcher.Fetcher - FetcherThread spin-waiting ...
      2014-04-16 05:11:23,733 DEBUG fetcher.Fetcher - FetcherThread spin-waiting ...
      2014-04-16 05:11:23,733 DEBUG httpclient.HttpMethodDirector - Retry authentication
      2014-04-16 05:11:23,733 DEBUG fetcher.Fetcher - FetcherThread spin-waiting ...
      2014-04-16 05:11:23,734 DEBUG httpclient.HttpMethodBase - Resorting to protocol version default close connection policy
      2014-04-16 05:11:23,733 DEBUG cookie.CookieSpec - Unrecognized cookie attribute: name=HttpOnly, value=null
      2014-04-16 05:11:23,734 DEBUG httpclient.HttpMethodBase - Should NOT close connection, using HTTP/1.1
      2014-04-16 05:11:23,735 DEBUG httpclient.HttpMethodBase - Cookie accepted: "PHPSESSID=9f9378mvh9e720f5o3l0ibc1o7"
      2014-04-16 05:11:23,735 DEBUG httpclient.HttpMethodDirector - Authenticating with NTLM <any realm>@sp.zzz.com:80
      2014-04-16 05:11:23,735 DEBUG httpclient.HttpMethodDirector - Redirect required
      2014-04-16 05:11:23,735 DEBUG params.HttpMethodParams - Credential charset not configured, using HTTP element charset
      2014-04-16 05:11:23,735 DEBUG httpclient.HttpMethodBase - Should close connection in response to directive: close
      2014-04-16 05:11:23,735 DEBUG httpclient.HttpConnection - Releasing connection back to connection manager.
      2014-04-16 05:11:23,736 DEBUG httpclient.MultiThreadedHttpConnectionManager - Freeing connection, hostConfig=HostConfiguration[host=www.xxxportal.com]
      2014-04-16 05:11:23,736 DEBUG util.IdleConnectionHandler - Adding connection at: 1397643083736
      2014-04-16 05:11:23,736 DEBUG httpclient.MultiThreadedHttpConnectionManager - Notifying no-one, there are no waiting threads
      2014-04-16 05:11:23,737 DEBUG httpclient.HttpMethodBase - Adding Host request header
      2014-04-16 05:11:23,744 DEBUG httpclient.HttpMethodDirector - Authorization required
      2014-04-16 05:11:23,744 DEBUG auth.AuthChallengeProcessor - Using authentication scheme: ntlm
      2014-04-16 05:11:23,744 DEBUG auth.AuthChallengeProcessor - Authorization challenge processed
      2014-04-16 05:11:23,744 DEBUG httpclient.HttpMethodDirector - Authentication scope: NTLM <any realm>@sp.zzz.com:80
      2014-04-16 05:11:23,745 INFO  regex.RegexURLNormalizer - can't find rules for scope 'fetcher', using default
      2014-04-16 05:11:23,745 DEBUG httpclient.HttpMethodDirector - Credentials required
      2014-04-16 05:11:23,745 DEBUG httpclient.HttpMethodDirector - Credentials provider not available
      2014-04-16 05:11:23,745 INFO  httpclient.HttpMethodDirector - Failure authenticating with NTLM <any realm>@sp.zzz.com:80
      2014-04-16 05:11:23,745 DEBUG httpclient.HttpMethodBase - Resorting to protocol version default close connection policy
      2014-04-16 05:11:23,745 DEBUG httpclient.HttpMethodBase - Should NOT close connection, using HTTP/1.1
      2014-04-16 05:11:23,746 DEBUG httpclient.HttpConnection - Releasing connection back to connection manager.
      2014-04-16 05:11:23,746 DEBUG httpclient.MultiThreadedHttpConnectionManager - Freeing connection, hostConfig=HostConfiguration[host=sp.zzz.com]
      

      我收到的其他几个链接

        

      处理请求时捕获到I / O异常(java.net.ConnectException):连接被拒绝:连接

      我不在任何代理后面,我仍然关闭了系统中的所有防火墙设置。不知道为什么我得到连接拒绝异常


      在这里,我也无法找到连接拒绝例外的确切原因。

      请帮助我理解这个案例的确切问题。

      Attaching the complete hadoop.log

      logPart2-connection拒绝了。

      2014-04-16 05:11:26,443 INFO  fetcher.Fetcher - * queue: www.xxxportal.com
      2014-04-16 05:11:26,443 INFO  fetcher.Fetcher -   maxThreads    = 1
      2014-04-16 05:11:26,444 INFO  fetcher.Fetcher -   inProgress    = 0
      2014-04-16 05:11:26,444 INFO  fetcher.Fetcher -   crawlDelay    = 5000
      2014-04-16 05:11:26,444 INFO  fetcher.Fetcher -   minCrawlDelay = 0
      2014-04-16 05:11:26,444 INFO  fetcher.Fetcher -   nextFetchTime = 1397643088739
      2014-04-16 05:11:26,444 INFO  fetcher.Fetcher -   now           = 1397643086444
      2014-04-16 05:11:26,444 INFO  fetcher.Fetcher -   0. www.xxxportal.com/profiles/
      2014-04-16 05:11:26,445 INFO  fetcher.Fetcher -   1. www.xxxportal.com/wiki/index.php
      2014-04-16 05:11:26,445 INFO  fetcher.Fetcher -   2. www.xxxportal.com/sop/
      2014-04-16 05:11:26,560 DEBUG httpclient.HttpMethodDirector - Closing the connection.
      2014-04-16 05:11:26,560 INFO  httpclient.HttpMethodDirector - I/O exception (java.net.ConnectException) caught when processing request: Connection refused: connect
      2014-04-16 05:11:26,560 DEBUG httpclient.HttpMethodDirector - Connection refused: connect
      java.net.ConnectException: Connection refused: connect
                      at java.net.DualStackPlainSocketImpl.waitForConnect(Native Method)
                      at java.net.DualStackPlainSocketImpl.socketConnect(DualStackPlainSocketImpl.java:85)
                      at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
                      at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
                      at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
                      at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:172)
                      at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
                      at java.net.Socket.connect(Socket.java:579)
                      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
                      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                      at java.lang.reflect.Method.invoke(Method.java:606)
                      at org.apache.commons.httpclient.protocol.ReflectionSocketFactory.createSocket(ReflectionSocketFactory.java:140)
                      at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:125)
                      at org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707)
                      at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.open(MultiThreadedHttpConnectionManager.java:1361)
                      at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387)
                      at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
                      at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
                      at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
                      at org.apache.nutch.protocol.httpclient.HttpResponse.<init>(HttpResponse.java:94)
                      at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:154)
                      at org.apache.nutch.protocol.http.api.HttpRobotRulesParser.getRobotRulesSet(HttpRobotRulesParser.java:75)
                      at org.apache.nutch.protocol.RobotRulesParser.getRobotRulesSet(RobotRulesParser.java:157)
                      at org.apache.nutch.protocol.http.api.HttpBase.getRobotRules(HttpBase.java:391)
                      at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:676)
      2014-04-16 05:11:26,564 INFO  httpclient.HttpMethodDirector - Retrying request
      2014-04-16 05:11:26,565 DEBUG httpclient.HttpConnection - Open connection to www.zzzlearninglounge.com:80
      

1 个答案:

答案 0 :(得分:1)

1)从日志中可以清楚地看到您的特定网站上的NTLM身份验证失败。

此处您必须先检查用户名/密码。

然后是Auth Basic / NTLM /的方案 然后是您要验证的端口

如果您验证这3点并使用正确的值,那么您的身份验证问题应该得到解决......