如何申请已经引用的网址?

时间:2012-02-13 21:04:21

标签: python python-requests

早期的代码给了我这个网址:http://en.wikipedia.org/wiki/M%C3%BCnster。现在,我想请求它,但无法想办法:

>>> requests.get('http://en.wikipedia.org/wiki/M%C3%BCnster')
<Response [400]>
>>> requests.get(urlparse.unquote('http://en.wikipedia.org/wiki/M%C3%BCnster'))
<Response [400]>
>>> requests.get(urlparse.unquote('http://en.wikipedia.org/wiki/M%C3%BCnster').decode('utf-8'))
<Response [400]>

问题是请求试图过于聪明地引用并实际要求:

Request URI: /wiki/M%25C3%25BCnster
Request URI: /wiki/M%25C3%25BCnster
Request URI: /wiki/M%25C3%25BCnster

有什么想法吗?

3 个答案:

答案 0 :(得分:2)

带有自定义User-Agent标头的简单urlparse.unquote似乎可以完成这项工作。

>>> s = 'http://en.wikipedia.org/wiki/M%C3%BCnster'
>>> import urllib2, urlparse
>>> headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.2; rv:9.0.1) Gecko/20100101 Firefox/9.0.1'}
>>> url = urlparse.unquote(s)
>>> req = urllib2.Request(url, None, headers)
>>> resp = urllib2.urlopen(req)
>>> print resp.code
200
>>> data = resp.read()
>>> print 'The last outstanding palace of the German baroque period is created according to plans by Johann Conrad Schlaun.' in data
True

不要将字节字符串解码为unicode对象,它会在urlopen中导致UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 11: ordinal not in range(128)

答案 1 :(得分:2)

这是请求中的错误。它已在develop分支中修复。请参阅:https://github.com/kennethreitz/requests/pull/387

答案 2 :(得分:1)

尝试添加.decode('utf-8')

requests.get(urlparse.unquote('http://en.wikipedia.org/wiki/M%C3%BCnster').decode('utf-8'))