python url解码%E3

时间:2014-12-19 06:59:53

标签: python encoding character-encoding urllib urldecode

我从freebase转储中获得了一些维基百科网址:

网址1:http://pt.wikipedia.org/wiki/Pedro_Miguel_de_Castro_Brand%25C3%25A3o_Costa

网址2:http://pt.wikipedia.org/wiki/Pedro_Miguel_de_Castro_Brand%E3o_Costa

他们都引用了维基百科上的同一页面:

网址3:http://pt.wikipedia.org/wiki /Pedro_Miguel_de_Castro_Brandão_Costa

urllib.unquote适用于网址1

url = 'Pedro_Miguel_de_Castro_Brand%25C3%25A3o_Costa'
url = urllib.unquote(url)
url = urllib.unquote(url)
print url

结果是

Pedro_Miguel_de_Castro_Brandão_Costa

但不适用于网址2。

url = 'Pedro_Miguel_de_Castro_Brand%E3o_Costa'
url = urllib.unquote(url)
print url

结果是

Pedro_Miguel_de_Castro_Brand�o_Costa    

有什么不对吗?

1 个答案:

答案 0 :(得分:4)

前者是双引号UTF-8,由于您的终端使用UTF-8,因此正常打印。后者引用Latin-1,需要先解码。

>>> print 'Pedro_Miguel_de_Castro_Brand\xe3o_Costa'
Pedro_Miguel_de_Castro_Brand�o_Costa
>>> print 'Pedro_Miguel_de_Castro_Brand\xe3o_Costa'.decode('latin-1')
Pedro_Miguel_de_Castro_Brandão_Costa