Python - 使用套接字获取源代码

时间:2014-01-01 18:45:08

标签: python sockets

我想发送http get请求并从网页接收源代码,这必须通过套接字来完成。我将缓冲区大小设置为4096,但我的脚本只从页面中下载了一小部分

import socket
sock = socket.socket ( socket.AF_INET, socket.SOCK_STREAM )
sock.connect ( ( "edition.cnn.com", 80 ) )

host = socket.gethostbyname("edition.cnn.com")
sock.sendall('GET http://edition.cnn.com/index.html HTTP/1.1\r\n'\
    + 'User-Agent: agent123\r\n'\
    + 'Host: '+host+'\r\n'\
    + '\r\n')

print sock.recv(4096)
sock.close()

运行此代码数据后,我得到了

HTTP/1.1 200 OK

Server: nginx

Date: Wed, 01 Jan 2014 18:31:25 GMT

Content-Type: text/html

Transfer-Encoding: chunked

Connection: keep-alive

Set-Cookie: CG=GR:44:Réthimnon; path=/

Last-Modified: Wed, 01 Jan 2014 18:31:22 GMT

Vary: Accept-Encoding

Cache-Control: max-age=60, private

Expires: Wed, 01 Jan 2014 18:32:25 GMT



ac2a


<!DOCTYPE HTML>
<html lang="en-US">
<head>
<title>CNN.com International - Breaking, World, Business, Sports, Entertainment and Video News</title>
<meta http-equiv="X-UA-Compatible" content="IE=edge"/>
<meta http-equiv="content-type" content="text/html;charset=utf-8"/>
<meta http-equiv="last-modified" content="2014-01-01T18:28:34Z"/>
<meta http-equiv="refresh" content="1800;url=http://edition.cnn.com/?refresh=1"/>
<meta name="robots" content="index,follow"/>
<meta name="googlebot" content="noarchive"/>
<meta name="description" content="CNN.com International delivers breaking news from across the globe and information on the latest top stories, business, sports and entertainment headlines. Follow the news as it happens through: special reports, videos, audio, photo galleries plus interactive maps and timelines."/>
<meta name="keywords" content="CNN, CNN news, CNN International, CNN International news, CNN Edition, Edition news, news, news online, breaking news, U.S. news, world news, global news, weather, business, CNN Money, sports, politics, law, technology, entertainment, education,

源代码甚至不是前13行... view-source:http://edition.cnn.com/index.html


另一个问题是,当我尝试像主持人一样地址google.com时

import socket
sock = socket.socket ( socket.AF_INET, socket.SOCK_STREAM )
sock.connect ( ( "google.com", 80 ) )

host = socket.gethostbyname("google.com")
sock.sendall('GET http://google.com/index.html HTTP/1.1\r\n'\
    + 'User-Agent: agent123\r\n'\
    + 'Host: '+host+'\r\n'\
    + '\r\n')
print sock.recv(4096)
sock.close()

我得到了这个回复

HTTP/1.1 301 Moved Permanently

Location: http://www.google.com/index.html

Content-Type: text/html; charset=UTF-8

Date: Wed, 01 Jan 2014 18:38:57 GMT

Expires: Fri, 31 Jan 2014 18:38:57 GMT

Cache-Control: public, max-age=2592000

Server: gws

Content-Length: 229

X-XSS-Protection: 1; mode=block

X-Frame-Options: SAMEORIGIN

Alternate-Protocol: 80:quic



<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="http://www.google.com/index.html">here</A>.

</BODY></HTML>

表示页面被移动到我希望下载的同一地址...

2 个答案:

答案 0 :(得分:3)

sock.recv(4096)将读取最多 4096字节;这取决于已经到达了多少数据,调用实际可以返回多少。无法保证4096字节实际上可以一次性读取。

您必须继续才能从套接字读取,直到收到所有数据:

data = ''
chunk = sock.recv(4096)
while chunk:
    data += chunk
    if len(data) >= 4096:
        break
    chunk = sock.recv(4096)

您对http://google.com/index.html的请求重定向到www.google.com不同的主机名。相应地调整您的请求。

如果要实现完整的HTTP客户端,则必须解析状态行,通过解析301标头并建立新连接来处理Location:重定向响应请求提供给您的新网址。

答案 1 :(得分:0)

edition.cnn.com使用HTTP / 1.0,www.google.com使用HTTP / 1.1。也许有人可以说明如何判断使用哪一个。

适用于: www.google.com

import socket
import time

domain = 'www.google.com'
# must specify index.html for google
full_url = 'http://www.google.com/index.html'


mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect((domain, 80))
mysock.send('GET ' + full_url + ' HTTP/1.1\n\n')

while True:
    data = mysock.recv(512)
    time.sleep(2.0)     # 2 second delay
    if len(data) < 1:
        break
    print data

mysock.close()

适用于: edition.cnn.com

警告:输出量大;考虑将recv(512)调整为更大的数字或将time.sleep(2.0)更改为1秒。

import socket
import time

domain = 'cnn.com'
full_url = 'http://edition.cnn.com/'

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect((domain, 80))
mysock.send('GET ' + full_url + ' HTTP/1.0\n\n')

while True:
    data = mysock.recv(512)
    time.sleep(2.0)     # 2 second delay
    if len(data) < 1:
        break
    print data

mysock.close()

两个进程都以退出代码0结束