Question

我正在通过文件夹递归并收集文档名称和一些其他要加载到数据库中的数据。

import os
text_file = open("Output.txt", "w")

dirName = 'D:\\'
for nextDir, subDir, fileList in os.walk(dirName):
    for fname in fileList: 
        text_file.write(fname + '\n')

问题是某些文档名称有外来字符，如：

RC-0964_1000 Tưởng thưởng Diamond trẻ nhất Việt Nam - Đặng Việt Thắng và Trần Thu Phương

和

RC-1046 安麗2013ARTISTRY冰上雅姿盛典-愛里歐娜．薩維琴科_羅賓．索爾科維【Suit & Tie】.mp4

上面的代码在最后一行给出了这个错误：

UnicodeEncodeError: 'charmap' codec can't encode characters at positions ##-##:character maps to (undefined)

我试过

temp = fname.endcode(utf-8)
temp = fname.decode(utf-8)
temp = fname.encode('ascii','ignore') temp2 = temp.decode('ascii')
temp =unicode(fname).encode('utf8')

如何编写此脚本以将所有字符写入文件？我是否需要更改我写入的文件或我写的字符串以及如何？

这些名称可以成功粘贴到文件中，那么为什么Python不会将它们写入？

Answer 1

由于它是Python 3，因此请选择支持所有Unicode的编码。在Windows上，至少，默认值是依赖于语言环境的，例如__weak typeof (self) (weakSelf) = self; [NSObject cancelPreviousPerformRequestsWithTarget:self selector:@selector(foo) object:nil]; [weakSelf bar];，并且对于像中文这样的字符将失败。

cp1252

Answer 2

默认情况下，text_file使用locale.getpreferredencoding(False)（在您的情况下为Windows ANSI代码页）。

如果输入路径是Windows上的Unicode，则

os.walk()使用Unicode API，因此它可能会生成无法使用Windows代码页（如cp1252）表示导致UnicodeEncodeError: 'charmap' codec can't encode错误的名称。诸如cp1252之类的8位编码只能代表256个字符，但是有超过一百万个Unicode字符。

要修复它，请使用可以表示给定名称的字符编码。 utf-8，utf-16字符编码可以表示所有Unicode字符。您可能更喜欢Windows上的utf-16，以便notepad.exe能够正确显示文件：

with open('output.txt', 'w', encoding='utf-16') as text_file:
    print('\N{VICTORY HAND}', file=text_file)

如何将外来编码的字符写入文本文件

2 个答案: