我正在尝试从网址中读取html。我尝试了以下方法:
import requests
f = requests.get('http://www.google.com')
print f.text
返回了以下Traceback:
requests.exceptions.ConnectionError: HTTPConnectionPool(host='www.google.com', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x03142310>: Failed to establish a new connection: [Errno 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond',))
所以,我假设我的工作(大学)有一个代理。我使用http://www.whatismyproxy.com/获取外部IP,猜测端口是80,并生成以下代码(IP已更改):
import requests
f = requests.get(link,
proxies={"http": "http://123.45.678.910:80"})
print f.text
这样做了,但它返回的html不是Google的(如果我将网址更改为Twitter,则相同):
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html>
<head>
<title>Index of /</title>
</head>
<body>
<h1>Index of /</h1>
<table>
<tr><th valign="top"><img src="/icons/blank.gif" alt="[ICO]"></th><th><a href="?C=N;O=D">Name</a></th><th><a href="?C=M;O=A">Last modified</a></th><th><a href="?C=S;O=A">Size</a></th><th><a href="?C=D;O=A">Description</a></th></tr>
<tr><th colspan="5"><hr></th></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[ ]"></td><td><a href="direct.dat">direct.dat</a></td><td align="right">2013-10-24 18:09 </td><td align="right"> 73 </td><td> </td></tr>
<tr><td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td><td><a href="errors/">errors/</a></td><td align="right">2015-01-13 16:15 </td><td align="right"> - </td><td> </td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[ ]"></td><td><a href="filtered.dat">filtered.dat</a></td><td align="right">2015-02-06 13:39 </td><td align="right">3.0K</td><td> </td></tr>
<tr><td valign="top"><img src="/icons/folder.gif" alt="[DIR]"></td><td><a href="html/">html/</a></td><td align="right">2016-09-30 07:50 </td><td align="right"> - </td><td> </td></tr>
<tr><td valign="top"><img src="/icons/unknown.gif" alt="[ ]"></td><td><a href="wpad.dat">wpad.dat</a></td><td align="right">2016-03-30 05:16 </td><td align="right">2.5K</td><td> </td></tr>
<tr><th colspan="5"><hr></th></tr>
</table>
<address>Apache/2.4.10 (Debian) Server at www.google.com Port 80</address>
</body></html>
这是我可以解决的问题,还是与我的工作设置有关(我如何确认)?
答案 0 :(得分:0)
我需要的代理设置,无法从其他网站查看。我是从wpad.dat文件中获取的,我在wpad.myuniversityname.ac找到了该文件。 第二个有用的注释是,您可能需要扩展代理设置字典以包括http和https设置:
proxies={"http": "http://123.45.678.910:80", "https": "http://123.45.678.910:80"}