使用请求库从洋葱链接获取HTML内容(Python 3)

时间:2018-10-10 19:10:01

标签: python web-scraping

我正在尝试使用请求库(或urllib.request)获取洋葱网站的html代码。我尝试了多种方法,但似乎都无法正常工作。

首先,我只是尝试使用请求库连接到代理并获取Facebook深度网络的HTML代码:

import requests

session = requests.session()
session.proxie = {}
session.proxies['http'] = 'socks5h://localhost:9050'
session.proxies['https'] = 'socks5h://localhost:9050'

r = requests.get('https://facebookcorewwwi.onion/')

print(r.text)

但是,当我这样做时,与代理的连接不起作用(无论有没有代理,我的IP都保持不变)。

我收到以下错误:

raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='facebookcorewwwi.onion', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x109e8b198>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known',))

做完一些研究后,我发现有人尝试做类似的事情,而解决方案是在导入requests / urllib.request库之前先连接到代理。

所以我尝试使用库sockssocket进行连接:

import socks
import socket

def create_connection(address, timeout=None, source_address=None):
    sock = socks.socksocket()
    sock.connect(address)
    return sock

socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, "127.0.0.1", 9050)

# patch the socket module
socket.socket = socks.socksocket
socket.create_connection = create_connection



import urllib.request

with urllib.request.urlopen('https://facebookcorewwwi.onion/') as response:
    html = response.read()
    print(html)

执行此操作时,代理的连接被拒绝:

urllib.error.URLError: <urlopen error Error connecting to SOCKS5 proxy 127.0.0.1:9050: [Errno 61] Connection refused>

我尝试使用requests库,而不是像Follow一样(只需在显示import urllib.request的行中替换它)

import requests
r = requests.get('https://facebookcorewwwi.onion/')
print(r.text)

但是在这里我得到这个错误:

raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='facebookcorewwwi.onion', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x10d93ee80>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known',))

似乎无论我做什么,都无法与代理建立连接。有谁有替代解决方案或解决此问题的方法?

0 个答案:

没有答案