Question

使用Python 2.5.2和Linux Debian，我正在尝试从包含西班牙语字符'í'的西班牙语URL中获取内容：

import urllib
url = u'http://mydomain.es/índice.html'
content = urllib.urlopen(url).read()

我收到了这个错误：

UnicodeEncodeError：'ascii'编解码器无法对位置8中的字符u'\ xe1'进行编码：序数不在范围内（128）

我在尝试将url传递给urllib之前使用了这个：

url = urllib.quote(url)

和此：

url = url.encode('UTF-8')

但他们没有用。

你能告诉我我做错了吗？

Answer 1

这对我有用：

#!/usr/bin/env python
# define source file encoding, see: http://www.python.org/dev/peps/pep-0263/
# -*- coding: utf-8 -*-

import urllib
url = u'http://example.com/índice.html'
content = urllib.urlopen(url.encode("UTF-8")).read()

Answer 2

将URL编码为utf-8，应该有效。我想知道你的源文件是否被正确编码，以及解释器是否知道它。例如，如果你的python源文件保存为UTF-8，那么你应该有

# coding=UTF-8

作为第一行或第二行。

import urllib
url = u'http://mydomain.es/índice.html'
content = urllib.urlopen(url.encode('utf-8')).read()

适合我。

编辑：另外，请注意交互式Python会话中的Unicode文本（无论是通过IDLE还是控制台）都充满了与编码相关的难题。在这些情况下，您应该使用Unicode文字（例如在您的情况下为\ u00ED）。

Answer 3

根据适用的标准RFC 1378，网址只能包含ASCII字符。好的解释here，我引用：

“......只有字母数字[0-9a-zA-Z]，特殊字符“$ -_。+！*'（），” [不包括引号 - 编辑]，和用于他们的保留字符可以使用保留的目的在URL中未编码。“

正如我给出的网址解释的那样，这可能意味着你必须用“％ED”替换“带有急性重音的小写i”。

Answer 4

它对我有用。确保您使用的是相当新版本的Python，并且您的文件编码是正确的。这是我的代码：

# -*- coding: utf-8 -*-
import urllib
url = u'http://mydomain.es/índice.html'
url = url.encode('utf-8')
content = urllib.urlopen(url).read()

（mydomain.es不存在，因此DNS查找失败，但到目前为止没有unicode问题。）

Answer 5

我现在有一个类似的案例。我正在尝试下载图像。我从服务器中以JSON文件检索URL。一些图像包含非ASCII字符。这会引发错误：

for image in product["images"]: 
    filename = os.path.basename(image) 
    filepath = product_path + "/" + filename 
    urllib.request.urlretrieve(image, filepath) # error!

UnicodeEncodeError：'ascii'编解码器无法在位置上编码字符'\ xc7'...

我尝试使用.encode("UTF-8")，但不能说有帮助：

# coding=UTF-8
import urllib
url = u"http://example.com/wp-content/uploads/2018/09/İMAGE-1.png"
url = url.encode("UTF-8")
urllib.request.urlretrieve(url, "D:\image-1.jpg")

这只会引发另一个错误：

TypeError：无法在类似字节的对象上使用字符串模式

然后我试了urllib.parse.quote(url)：

import urllib
url = "http://example.com/wp-content/uploads/2018/09/İMAGE-1.png"
url = urllib.parse.quote(url)
urllib.request.urlretrieve(url, "D:\image-1.jpg")

再次，这引发了另一个错误：

ValueError：未知的URL类型：'http％3A // example.com / wp-content / uploads / 2018/09 /％C4％B0MAGE-1.png'

:中的"http://..."也被逃脱了，我认为这是问题的原因。

因此，我想出了一种解决方法。我只是引用/转义路径，而不是整个URL。

import urllib.request
import urllib.parse
url = "http://example.com/wp-content/uploads/2018/09/İMAGE-1.png"
url = urllib.parse.urlparse(url)
url = url.scheme + "://" + url.netloc + urllib.parse.quote(url.path)
urllib.request.urlretrieve(url, "D:\image-1.jpg")

URL如下所示："http://example.com/wp-content/uploads/2018/09/%C4%B0MAGE-1.png"，现在我可以下载图像了。

无法使用Python打开Unicode URL

5 个答案: