Question

如何使用nutch对基于身份验证的页面进行爬网？我已经在nutch-site.xml，nutch-default.xml和httpclient-auth.xml中完成了所有必需的设置。它仍然显示以下内容：

Generator: 0 records selected for fetching, exiting ...
Stopping at depth=1 - no more URLs to fetch.

我关注了以下链接 link 1， link 2。但是我的搜寻器仍然无法搜寻页面。有什么方法可以使用API密钥来帮助爬网？

Answer 1

您需要配置AbstractMethodError 这是带有身份验证的solr UI的示例，与您的网站相同。

httpclient-auth.xml

查看此文件中的多个示例并尝试

并添加到坚果站点

<auth-configuration>
   <credentials username="solr" password="xxx">
      <authscope host="localhost" port="8983"/>
   </credentials>
</auth-configuration>

使用Apache Nuch爬行基于身份验证的页面

1 个答案: