'UCS-2'无法编码

时间:2017-07-07 07:36:29

标签: python-3.x utf-8

我正在尝试读取文本文件,但它会抛出一个错误。

UnicodeEncodeError: 'UCS-2' codec can't encode characters in position 12416-12416: Non-BMP character not supported in Tk

我也试图忽略它,但我没有工作。 这是代码:

with io.open('reviews1.txt', mode='r',encoding='utf-8') as myfile:
document1=myfile.read().replace('\n', '')
print(document1)

2 个答案:

答案 0 :(得分:0)

问题不在于读取文件(这将是 de 编码错误)。 它与打印表达式有关:您的环境显然无法处理BMP之外的字符,例如表情符号。

如果要将这些字符打印到STDOUT,可以检查shell / IDE是否支持支持所有Unicode(UTF-8,UTF-16 ...)的编码。 或者您切换到另一个环境来运行脚本。

如果您想在同一设置中运行它,您可以自己对数据进行编码,这样您就可以选择指定自定义错误处理:

data = document1.encode('UCS-2', errors='replace')
sys.stdout.buffer.write(data)

这会将不受支持的字符替换为?或其他字符。 您还可以指定errors='ignore',这将取消字符。

但是,我无法对此进行测试,因为我的编解码器库并不知道UCS-2编码。它是Windows使用的过时标准,直到NT。

答案 1 :(得分:-1)

我可以在Python IDLE environmentPython version 3.5.1, Tk version 8.6.4, IDLE version 3.5.1)中重现错误。这似乎是Tk中的一个错误。但是,原始脚本可以从控制台(Windows cmd,在我的情况下)顺利运行:Python 3.5.1 (v3.5.1:37a07cee5969, Dec 6 2015, 01:54:25) [MSC v.1900 64 bit (AMD64)] on win32

我能看到的唯一方法可能是非常慢:以下评论脚本逐个字符地复制整个文档字符,从Basic Multilingual Plane中删除所有文档

修改:我找到了this (more Python-ish) solution (thanks to Mark Ransom)。不幸的是,这在Python shell中运行,但Python控制台抱怨:

>>> print( ''.join(c if c <= '\uffff' else ''.join(chr(x) for x in struct.unpack(
...   '>2H', c.encode('utf-16be'))) for c in document1)
... )
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "C:\Python\Python35\lib\site-packages\win_unicode_console\streams.py",
line 179, in write

    return self.base.write(s)
UnicodeEncodeError: 'utf-16-le' codec can't encode character '\ud83d' in position 0: 
surrogates not allowed
>>>

-

# -*- coding: utf-8 -*-

import sys, io
import os, codecs                       # for debugging

print(os.path.basename(sys.executable), sys.argv[0], '\n') # for debugging

#######################
### original answer ###
#######################
filepath = 'D:\\test\\reviews1.txt'
with io.open(filepath, mode='r',encoding='utf-8') as myfile:
    document1=myfile.read() #.replace('\n', '')
    document2=u''
    for character in document1:
        ordchar = ord(character)
        if ordchar <= 0xFFFF:
            # debugging # print( 'U+%.4X' % ordchar, character)
            document2+=character
        else:
            # debugging # print( 'U+%.6X' % ordchar, '�')
            ###         �=Replacement Character; codepoint=U+FFFD; utf8=0xEFBFBD
            document2+='�'
print(document2)                        # original answer, runs universally

######################
### updated answer ###
######################
if os.path.basename(sys.executable) == 'pythonw.exe':    
    import struct
    document3 = ''.join(c if c <= '\uffff' else ''.join(chr(x) for x in struct.unpack('>2H', c.encode('utf-16be'))) for c in document1)
    print(document3)                    # Pythonw shell
else:
    print(document1)                    # Python console

输出,Pythonw shell:

================== RESTART: D:/test/Python/Py/q44965129a.py ==================
pythonw.exe D:/test/Python/Py/q44965129a.py 

� smiling face with smiling eyes �
� smiling face with open mouth   �
� angry face                     �

 smiling face with smiling eyes 
 smiling face with open mouth   
 angry face                     

>>>

输出,Python控制台:

==> D:\test\Python\Py\q44965129a.py
python.exe D:\test\Python\Py\q44965129a.py

� smiling face with smiling eyes �
� smiling face with open mouth   �
� angry face                     �

 smiling face with smiling eyes 
 smiling face with open mouth   
 angry face                     

==>