我想发送http get请求并从网页接收源代码,这必须通过套接字来完成。我将缓冲区大小设置为4096,但我的脚本只从页面中下载了一小部分
import socket
sock = socket.socket ( socket.AF_INET, socket.SOCK_STREAM )
sock.connect ( ( "edition.cnn.com", 80 ) )
host = socket.gethostbyname("edition.cnn.com")
sock.sendall('GET http://edition.cnn.com/index.html HTTP/1.1\r\n'\
+ 'User-Agent: agent123\r\n'\
+ 'Host: '+host+'\r\n'\
+ '\r\n')
print sock.recv(4096)
sock.close()
运行此代码数据后,我得到了
HTTP/1.1 200 OK
Server: nginx
Date: Wed, 01 Jan 2014 18:31:25 GMT
Content-Type: text/html
Transfer-Encoding: chunked
Connection: keep-alive
Set-Cookie: CG=GR:44:Réthimnon; path=/
Last-Modified: Wed, 01 Jan 2014 18:31:22 GMT
Vary: Accept-Encoding
Cache-Control: max-age=60, private
Expires: Wed, 01 Jan 2014 18:32:25 GMT
ac2a
<!DOCTYPE HTML>
<html lang="en-US">
<head>
<title>CNN.com International - Breaking, World, Business, Sports, Entertainment and Video News</title>
<meta http-equiv="X-UA-Compatible" content="IE=edge"/>
<meta http-equiv="content-type" content="text/html;charset=utf-8"/>
<meta http-equiv="last-modified" content="2014-01-01T18:28:34Z"/>
<meta http-equiv="refresh" content="1800;url=http://edition.cnn.com/?refresh=1"/>
<meta name="robots" content="index,follow"/>
<meta name="googlebot" content="noarchive"/>
<meta name="description" content="CNN.com International delivers breaking news from across the globe and information on the latest top stories, business, sports and entertainment headlines. Follow the news as it happens through: special reports, videos, audio, photo galleries plus interactive maps and timelines."/>
<meta name="keywords" content="CNN, CNN news, CNN International, CNN International news, CNN Edition, Edition news, news, news online, breaking news, U.S. news, world news, global news, weather, business, CNN Money, sports, politics, law, technology, entertainment, education,
源代码甚至不是前13行... view-source:http://edition.cnn.com/index.html
另一个问题是,当我尝试像主持人一样地址google.com时
import socket
sock = socket.socket ( socket.AF_INET, socket.SOCK_STREAM )
sock.connect ( ( "google.com", 80 ) )
host = socket.gethostbyname("google.com")
sock.sendall('GET http://google.com/index.html HTTP/1.1\r\n'\
+ 'User-Agent: agent123\r\n'\
+ 'Host: '+host+'\r\n'\
+ '\r\n')
print sock.recv(4096)
sock.close()
我得到了这个回复
HTTP/1.1 301 Moved Permanently
Location: http://www.google.com/index.html
Content-Type: text/html; charset=UTF-8
Date: Wed, 01 Jan 2014 18:38:57 GMT
Expires: Fri, 31 Jan 2014 18:38:57 GMT
Cache-Control: public, max-age=2592000
Server: gws
Content-Length: 229
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
Alternate-Protocol: 80:quic
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="http://www.google.com/index.html">here</A>.
</BODY></HTML>
表示页面被移动到我希望下载的同一地址...
答案 0 :(得分:3)
sock.recv(4096)
将读取最多 4096字节;这取决于已经到达了多少数据,调用实际可以返回多少。无法保证4096字节实际上可以一次性读取。
您必须继续才能从套接字读取,直到收到所有数据:
data = ''
chunk = sock.recv(4096)
while chunk:
data += chunk
if len(data) >= 4096:
break
chunk = sock.recv(4096)
您对http://google.com/index.html
的请求重定向到www.google.com
,不同的主机名。相应地调整您的请求。
如果要实现完整的HTTP客户端,则必须解析状态行,通过解析301
标头并建立新连接来处理Location:
重定向响应请求提供给您的新网址。
答案 1 :(得分:0)
edition.cnn.com使用HTTP / 1.0,www.google.com使用HTTP / 1.1。也许有人可以说明如何判断使用哪一个。
适用于: www.google.com
import socket
import time
domain = 'www.google.com'
# must specify index.html for google
full_url = 'http://www.google.com/index.html'
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect((domain, 80))
mysock.send('GET ' + full_url + ' HTTP/1.1\n\n')
while True:
data = mysock.recv(512)
time.sleep(2.0) # 2 second delay
if len(data) < 1:
break
print data
mysock.close()
适用于: edition.cnn.com
警告:输出量大;考虑将recv(512)调整为更大的数字或将time.sleep(2.0)更改为1秒。
import socket
import time
domain = 'cnn.com'
full_url = 'http://edition.cnn.com/'
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect((domain, 80))
mysock.send('GET ' + full_url + ' HTTP/1.0\n\n')
while True:
data = mysock.recv(512)
time.sleep(2.0) # 2 second delay
if len(data) < 1:
break
print data
mysock.close()
两个进程都以退出代码0结束