urllib.request中的Unicode字符串

时间:2016-04-04 06:17:32

标签: python python-3.x unicode encoding

简短版本:我有一个变量s = 'bär'。我需要将s转换为ASCII,以便s = 'b%C3%A4r'

长版:

我正在使用urllib.request.urlopen()从URL读取mp3发音文件。这非常有效,除了我遇到问题,因为URL通常包含unicode字符。例如,德国“Bär”。完整网址为https://d7mj4aqfscim2.cloudfront.net/tts/de/token/bär。实际上,将此作为URL输入Chrome,并将我导航到mp3文件没有问题。但是,将此相同的网址提供给urllib会产生问题。

我确定这是一个unicode问题,因为堆栈跟踪读取:

Traceback (most recent call last):
  File "importer.py", line 145, in <module>
    download_file(tuple[1], tuple[0], ".mp3")
  File "importer.py", line 81, in download_file
    with urllib.request.urlopen(url) as in_stream, open(to_fname+ext, 'wb') as out_file: #`with object as name:` safely __enter__() and __exit__() the runtime of object. `as` assigns `name` as referring to the object `object`.
  File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 162, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 465, in open
    response = self._open(req, data)
  File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 483, in _open
    '_open', req)
  File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 443, in _call_chain
    result = func(*args)
  File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 1283, in https_open
    context=self._context, check_hostname=self._check_hostname)
  File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 1240, in do_open
    h.request(req.get_method(), req.selector, req.data, headers)
  File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 1083, in request
    self._send_request(method, url, body, headers)
  File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 1118, in _send_request
    self.putrequest(method, url, **skips)
  File "C:\Users\quesm\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 960, in putrequest
    self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode character '\xfc' in position 19: ordinal not in range(128)

...除了显而易见的UnicodeEncodeError之外,我可以看到它正在尝试encode()到ASCII。

有趣的是,当我从Chrome复制网址时(而不是简单地将其输入到Python解释器中),它将bär翻译为b%C3%A4r。当我将其提供给urllib.request.urlopen()时,它处理正常,因为所有这些字符都是ASCII。所以我的目标是在我的程序中进行这种转换。我试图将原始字符串转换为unicode等效字符,但所有变体中的unicodedata.normalize()都不起作用;此外,我不确定如何将Unicode存储为ASCII,因为Python 3将所有字符串存储为Unicode,因此不会尝试转换文本。

1 个答案:

答案 0 :(得分:1)

使用urllib.parse.quote

>>> urllib.parse.quote('bär')
'b%C3%A4r'
>>> urllib.parse.urljoin('https://d7mj4aqfscim2.cloudfront.net/tts/de/token/',
...                      urllib.parse.quote('bär'))
'https://d7mj4aqfscim2.cloudfront.net/tts/de/token/b%C3%A4r'