curl -L http://tumblr.com/tagged/long-reads
结果如下:http://pastebin.com/XtQVubBp
该回复与
不同def download(url):
f = urllib2.urlopen(url)
return f.read()
html = download('http://tumblr.com/tagged/long-reads')
print html
这是第二个结果:http://pastebin.com/MdzrhBZv
为什么呢? 我想下载()返回curl做的同样的事情。我该怎么办?
这是CURL请求标题。
$ curl -v -L http://tumblr.com/tagged/long-reads
* About to connect() to tumblr.com port 80 (#0)
* Trying 50.97.149.179... connected
* Connected to tumblr.com (50.97.149.179) port 80 (#0)
> GET /tagged/long-reads HTTP/1.1
> User-Agent: curl/7.21.6 (i686-pc-linux-gnu) libcurl/7.21.6 OpenSSL/1.0.0e zlib/1.2.3.4 libidn/1.22 librtmp/2.3
> Host: tumblr.com
> Accept: */*
>
< HTTP/1.1 301 Moved Permanently
< Cache-Control: no-cache
< Content-length: 0
< Location: http://www.tumblr.com/tagged/long-reads
< Connection: close
<
* Closing connection #0
* Issue another request to this URL: 'http://www.tumblr.com/tagged/long-reads'
* About to connect() to www.tumblr.com port 80 (#0)
* Trying 50.97.143.18... connected
* Connected to www.tumblr.com (50.97.143.18) port 80 (#0)
> GET /tagged/long-reads HTTP/1.1
> User-Agent: curl/7.21.6 (i686-pc-linux-gnu) libcurl/7.21.6 OpenSSL/1.0.0e zlib/1.2.3.4 libidn/1.22 librtmp/2.3
> Host: www.tumblr.com
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Mon, 07 May 2012 22:09:01 GMT
< Server: Apache
< P3P: CP="ALL ADM DEV PSAi COM OUR OTRo STP IND ONL"
< Set-Cookie: tmgioct=iVajmrL8Wj8YffLTthjFyqYn; expires=Thu, 05-May-2022 22:09:01 GMT; path=/; httponly
< Vary: Accept-Encoding
< X-Tumblr-Usec: D=266934
< Connection: close
< Transfer-Encoding: chunked
< Content-Type: text/html; charset=UTF-8
<
编辑:我将提供一个500 BOUNTY,以解决我现在的问题。
答案 0 :(得分:0)
很难确切地知道如何使它们看起来一样;您必须知道curl
正在使用哪些标头,并在urllib2
中重现这些标头。但是,一旦您知道curl
使用的标头,应该就像在Request
对象中设置这些标头一样简单:
>>> moz_req = urllib2.Request('http://www.google.com', headers={'User-Agent': 'Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11'})
>>> pyt_req = urllib2.Request('http://www.google.com', headers={'User-Agent': 'Python-urllib/2.6'})
>>> moz_url = urllib2.urlopen(moz_req)
>>> moz_str = moz_url.read()
>>> moz_url.close()
>>> pyt_url = urllib2.urlopen(pyt_req)
>>> pyt_str = pyt_url.read()
>>> pyt_url.close()
>>> moz_str == pyt_str
False
当我执行以下操作时,我会看到一个充满博客文章的页面。
import urllib2
def download(url):
headers = {'User-Agent': 'Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11'}
req = urllib2.Request(url, headers=headers)
url = urllib2.urlopen(req)
page = url.read()
url.close()
return page
html = download('http://tumblr.com/tagged/long-reads')
page = open('page.html', 'w')
page.write(html)
page.close()
但是,我检查了,即使没有设置标题也能得到相同的结果。还有别的错误......
答案 1 :(得分:0)
如果您使用https而不是http,则至少会得到结果。顺便说一句,这不是最新库的问题。