Question

我按照

中的教程

Nutch Wiki“SetupNutchAndTor”（https://wiki.apache.org/nutch/SetupNutchAndTor）

设置nutch-site.xml

  <property>
        <name>http.proxy.host</name>
        <value>127.0.0.1</value>
        <description>The proxy hostname.  If empty, no proxy is used.
        </description>
  </property>

    <property>
        <name>http.proxy.port</name>
        <value>8118</value>
        <description>The proxy port.</description>
    </property>

但仍然没有从.onion链接抓取任何内容，也没有索引到Solr。任何人都知道这是什么问题？

Answer 1

日志中的任何内容？

使用StormCrawler进行FYI，您可以直接使用SOCKS代理，这要归功于this commit

您需要使用OKHTTP进行协议实现，并将其配置为

http.protocol.implementation：＆＃34; com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol＆＃34;    https.protocol.implementation:" com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol＆＃34;

http.proxy.host:localhost
  http.proxy.port:9050
  http.proxy.type:" SOCKS＆＃34;

Nutch 2.3.1在抓取Deep Web

1 个答案: