Question

我正在编写一个小程序，通过提供URL来从网页中获取所有超链接，但看起来我所在的网络正在使用代理，但它无法获取.. 我的代码：

import sys
import urllib
import urlparse

from bs4 import BeautifulSoup
def process(url):
    page = urllib.urlopen(url) 
    text = page.read()
    page.close()
    soup = BeautifulSoup(text) 
    with open('s.txt','w') as file:
        for tag in soup.findAll('a', href=True):
            tag['href'] = urlparse.urljoin(url, tag['href'])
            print tag['href']
            file.write('\n')
            file.write(tag['href'])


def main():
    if len(sys.argv) == 1:
        print 'No url !!'
        sys.exit(1)
    for url in sys.argv[1:]:
        process(url)

Answer 1

您用于HTTP访问的urllib库不支持代理身份验证（它支持未经身份验证的代理）。来自the docs：

目前不需要使用身份验证的代理支持的;这被视为实施限制。

我建议您切换到urllib2并按照the answer to this post中的说明使用它。

Answer 2

您可以改用请求模块。

import requests

proxies = { 'http': 'http://host/ } 
# or if it requires authentication 'http://user:pass@host/' instead

r = requests.get(url, proxies=proxies)
text = r.text

如何通过代理使用Python访问网页

2 个答案: