如何使用Python将带有cp1252字符的unicode字符串转换为UTF-8?

时间:2017-07-25 01:43:10

标签: python unicode encoding utf-8 cp1252

我通过API获取文本,该API返回带有Windows编码撇号(\ x92)的字符:

> python
>>> title = u'There\x92s thirty days in June'
>>> title
u'There\x92s thirty days in June'
>>> print title
Theres thirty days in June
>>> type(title)
<type 'unicode'>

我试图将此字符串转换为UTF-8,以便它返回:&#34;六月有三十天&#34;

当我尝试解码或编码此unicode字符串时,会抛出错误:

>>> title.decode('cp1252')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/cp1252.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeEncodeError: 'ascii' codec can't encode character u'\x92' in position 5: ordinal not in range(128)

>>> title.encode("cp1252").decode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/cp1252.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\x92' in position 5: character maps to <undefined>

如果我将字符串初始化为纯文本然后对其进行解码,则可以正常工作:

>>>title = 'There\x92s thirty days in June'
>>> type(title)
<type 'str'>
>>>print title.decode('cp1252')
There’s thirty days in June

我的问题是如何将我收到的unicode字符串转换成纯文本字符串以便我可以对其进行解码?

1 个答案:

答案 0 :(得分:4)

您的字符串似乎已解码 =IFERROR(INDEX($BF$1:$BF$100,AGGREGATE(15,6,(ROW($BF$1:$BF$100)-ROW($BF$1)+1)/( ($BF$1:$BF$100<>"")*($BF$1:$BF$100<>" ")),ROWS(BH$1:BH2))),"") (因为它的类型为latin1

  1. 要将其转换回原来的字节数,您需要使用该编码编码unicode
  2. 然后要返回文字(latin1),您必须使用正确的编解码器(unicode
  3. 解码
  4. 最后,如果您想要获得cp1252字节,则必须使用utf-8编解码器编码
  5. 在代码中:

    UTF-8

    根据API是采用文本(>>> title = u'There\x92s thirty days in June' >>> title.encode('latin1') 'There\x92s thirty days in June' >>> title.encode('latin1').decode('cp1252') u'There\u2019s thirty days in June' >>> print(title.encode('latin1').decode('cp1252')) There’s thirty days in June >>> title.encode('latin1').decode('cp1252').encode('UTF-8') 'There\xe2\x80\x99s thirty days in June' >>> print(title.encode('latin1').decode('cp1252').encode('UTF-8')) There’s thirty days in June )还是unicode,可能没有必要。