Question

字符串编码和格式总是抛弃我。

这就是我所拥有的：

'ไทย'

我认为是UTF-8，

'XN - o3cw4h'

在IDNA编码中应该是相同的。但是，我无法弄清楚如何让python从一个转换为另一个。

我只是在尝试

a = u'xn--o3cw4h'
b = a.encode('idna')
b.decode('utf-8')

但我得到完全相同的字符串（'xn - o3cw4h'，虽然不再是unicode）。我目前正在使用python 3.5。

Answer 1

要从一种编码转换为另一种编码，首先必须将字符串解码为Unicode，然后再以目标编码对其进行编码。

所以，例如：

idna_encoded_bytes = b'xn--o3cw4h'
unicode_string = idna_encoded_bytes.decode('idna')
utf8_encoded_bytes = unicode_string.encode('utf-8')

print (repr(idna_encoded_bytes))
print (repr(utf8_encoded_bytes))
print (repr(unicode_string))

Python2结果：

'xn--o3cw4h'
'\xe0\xb9\x84\xe0\xb8\x97\xe0\xb8\xa2'
u'\u0e44\u0e17\u0e22'

如您所见，第一行是ไทย的IDNA编码，第二行是utf8编码，最后一行是Unicode代码点U-0E44，U-0E17和U-0E22的未编码序列

要一步完成转换，只需将操作链接起来：

utf8_encoded_bytes = idna_encoded_bytes.decode('idna').encode('utf8')

回应评论：

我从isn＆＃39; xn - o3cw4h＆＃39;开始但只是字符串＆＃39; xn - o3cw4h＆＃39;。 [在Python3中]。

你在那里有一只奇怪的鸭子。您有明显编码的数据存储在unicode字符串中。我们需要以某种方式将其转换为bytes对象。一种简单的方法是使用（令人困惑的）ASCII编码：

improperly_encoded_idna = 'xn--o3cw4h'
idna_encoded_bytes = improperly_encoded_idna.encode('ascii')
unicode_string = idna_encoded_bytes.decode('idna')
utf8_encoded_bytes = unicode_string.encode('utf-8')

print (repr(idna_encoded_bytes))
print (repr(utf8_encoded_bytes))
print (repr(unicode_string))

字符串编码IDNA - ＆gt; UTF-8（Python）

1 个答案: