Question

我正在为初学者参加在线Python课程。单元的内容是教导学生提取网页源代码中的所有链接。代码如下，Block_of_Code未知：

def get_page(url):
    <Block_of_Code>

def get_next_target(page):
    start_link=page.find('<a href=')
    if start_link==-1:
        return None,0
    start_quote=page.find('"',start_link)
    end_quote=page.find('"',start_quote+1)
    url=page[start_quote+1:end_quote]
    return url,end_quote

def print_all_links(page):
    while True:
        url,endpos=(get_next_target(page))
        if url:
            print(url)
            page=page[endpos:]
        else:
            break

print_all_links(get_page('https://youtube.com'))

如果我不在中国，Block_of_Code对我来说应该不是问题。据我所知，它可能是：

import urllib.request
return urllib.request.urlopen(url).read().decode('utf-8')

但是在中国，某些网站（包括youtube）被屏蔽了。所以上面的代码并不适用于它们。

Block_of_Code的目标是获取任何网站的源代码，无论是否被屏蔽。

我在Google上搜索过并发现了一些使用socks代理的代码，但没有一个能够运行。例如，我根据this article编写并尝试了以下代码（已执行pip install PySocks）。

import socket
import socks
import urllib.request

socks.set_default_proxy(socks.SOCKS5, "127.0.0.1", 2012)
socket.socket = socks.socksocket
return urllib.request.urlopen(url).read().decode('utf-8')

错误消息是：

ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host

我使用socks代理搜索代码的原因是我一直使用socks代理服务来访问被阻止的网站。通过启动我的服务提供商提供的应用程序，我可以使用Firefox等网络浏览器访问这些网站。（我的袜子代理端口是2012年）

然而，任何类型的解决方案都是受欢迎的，无论是否是socks代理，只要它能使我获得任何页面的来源。

我在Windows 10上使用Python 3.6.3。

如何使用Python中的SOCKS代理访问网站

0 个答案: