Question

我在utf-16中有一个文件，我想要的是将其转换为utf-8并删除BOM。以下代码适用于转换，但我无法弄清楚如何以最有效的方式删除BOM。

def convert_to_utf8(event):                                                     
     blocksize = 1048576                                                         
     output_file = add_timestamp(event.pathname)                                 
     with open(event.pathname, 'r') as char_set:                                 
         enc = chardet.detect(char_set.read(blocksize))['encoding']              
         print enc                                                               

     with codecs.open(event.pathname, 'rb', encoding = enc) as encoded_file:        
         with codecs.open(output_file, "w+b", encoding = 'utf-8') as utf8_file:  
             while True:                                                         
                 content = encoded_file.read(blocksize)                          
                 if not content:                                                 
                     break                                                       
                 #if content.startswith(codecs.BOM_UTF8):                        
                 #    content.replace(codecs.BOM_UTF8, '')                       
                 utf8_file.write(content)

这是初始文件：

$ file test_16.csv -bi
text/plain; charset=utf-16le

这是以下文件：

file -bi test_16-1390343202.csv
text/plain; charset=utf-8

这就是我检查BOM的方式：

>>> with open('test_16-1390343202.csv', 'rb') as f:
...     repr(f.readline())

"'\\xef\\xbb\\xbfFOO,BAR,BAZ\\r\\n'"

Answer 1

你对注释掉的代码有正确的想法，只需要稍微调整一下。一旦您使用编解码器读取BOM，它不再是3字节的UTF-8序列，甚至是UTF-16代码，它只是一个Unicode字符U+FEFF。

if content[0] == U'\uFEFF':
    content = content[1:]

另请注意，replace函数不起作用，因为它不进行就地替换 - 它不能，因为Python中的字符串是不可变的。您可以将结果分配回自身。因为我们知道它只是一个字符，所以我用一个片段简化它。

Answer 2

在循环之前读取单个字符。如果它不是BOM，那么将其写出来，否则忽略它。

无法将文件从UTF-16转换为UTF-8并删除BOM

2 个答案: