Question

我正在尝试构建一个抓取.onion域上托管的各种网页的网站。这意味着它不像调用requests.get("http://XXX.onion")那么简单，因为.onion只能通过TOR连接来实现。

我可以使用像onion.to这样的重定向程序，但需要点击一下，这在我抓取时无效。

我不关心无比，我只想要数据。

Answer 1

请求支持HTTP代理，但不支持SOCKS代理，这是Tor为您提供的。

您可以获得已修补的请求版本：How to make python Requests work via socks proxy

或安装Polipo并将其用作另一个代理，将Tor的SOCKS5代理“转换”为HTTP / HTTPS代理。这是我的配置文件：

proxyName = "localhost"
proxyAddress = "127.0.0.1"
proxyPort = 8118

allowedClients = 127.0.0.1
allowedPorts = 1-65535

cacheIsShared = false
chunkHighMark = 67108864

socksParentProxy = "localhost:9050"
socksProxyType = socks5


diskCacheRoot = ""
localDocumentRoot = ""

disableLocalInterface = true
disableConfiguration = true
disableVia = true

dnsUseGethostbyname = yes

maxConnectionAge = 5m
maxConnectionRequests = 120

serverMaxSlots = 8
serverSlots = 2

tunnelAllowedPorts = 1-65535

现在，您只需将代理用于请求：

proxies = {
    'http': 'localhost:8118',
    'https': 'localhost:8118'
}

requests.get('http://something.onion/', proxies=proxies)

Answer 2

为什么不设置Tor并使用一堆wget和torsocks？

e.g。

# torsocks wget -c -mirror http://kpvz7ki2v5agwt35.onion

使用.onion域抓取网站的最简单方法？

2 个答案: