为什么python脚本无法通过代理下载网页

时间:2011-08-09 18:07:28

标签: python sockets proxy

我是python的新手,我在插座上试试运气。所以我写了一个简单的http客户端,但令我惊讶的是它无法访问firefox可以访问的网页,但他们使用相同的标题

import socket
clientsocket= socket.socket(socket.AF_INET, socket.SOCK_STREAM)
clientsocket.connect(("213.229.83.205",80))#connect to proxy at given address
print "connected to 213.229.83.205"
sdata= """GET http://google.co.ug/ HTTP/1.1
Host: google.co.ug
User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:6.0) Gecko/20100101 Firefox/6.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip, deflate
Proxy-Connection: keep-alive
Cookie: cookie <-- Real cookie deleted

"""
print "sending request"
clientsocket.send(sdata);
rdata=clientsocket.recv(10240)
if not rdata: print "no data found"
else:
    print "receiving data !"
    myfile=open("c:/users/markdenis/desktop/google.html","w")
    myfile.write(str(rdata))
    myfile.close()
    print "data written to file on desktop"
clientsocket.close()
raw_input()#system(pause)

当我运行它时,它会显示:

connected to 213.229.83.205
sending request
no data found

1 个答案:

答案 0 :(得分:5)

HTTP协议在每个标头的末尾需要\r\n,在HTTP标头的末尾需要一个空行。您没有明确sdata缓冲区中的行结尾,因此您的缓冲区最终只有\n行结尾。

在Windows,Linux和OS X上测试,确保:

>>> x = """a
b
c"""
>>> x
'a\\nb\\nc\\n'

您需要的地方:

>>> x = "a\r\nb\r\nc\r\n"
>>> x
'a\\r\\nb\\r\\nc\\r\\n'

添加\r\n并尝试一下。直接在缓冲区中执行此操作会为您提供额外的\n集,因此请将其拆分:

sdata = "GET http://google.co.ug/ HTTP/1.1\r\n"
sdata += "Host: google.co.ug\r\n"
sdata += "User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:6.0) Gecko/20100101 Firefox/6.0\r\n"
sdata += "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\n"
sdata += "Accept-Language: en-us,en;q=0.5\r\n"
sdata += "Accept-Encoding: gzip, deflate\r\n"
sdata += "Proxy-Connection: keep-alive\r\n"
sdata += "\r\n"