Question

几天后，我在Python的小程序中遇到了这个烦人的文件编码问题。

我在MediaWiki上工作很多 - 最近我做的文件从.doc转换到Wikisource。

Microsoft Word格式的文档在Libre Office中打开，然后导出为具有Wikisource格式的.txt文件。我的程序正在搜索[[Image：]]标签并将其替换为从列表中获取的图像名称 - 该机制工作得非常好（非常感谢帮助brjaga！）。当我对我创建的.txt文件进行一些测试时，一切工作都很好但是当我把一个带有Wikisource的.txt文件时，整个事情就不再那么有趣了：D

我收到了这条消息：Python：

Traceback (most recent call last):
  File "C:\Python33\final.py", line 15, in <module>
    s = ' '.join([line.replace('\n', '') for line in myfile.readlines()])
  File "C:\Python33\lib\encodings\cp1250.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 7389: character maps to <undefined>

这是我的Python代码：

li = [
    "[[Image:124_BPP_PL_PL_Page_03_Image_0001.jpg]]",
    "[[Image:124_BPP_PL_PL_Page_03_Image_0002.jpg]]",
    "[[Image:124_BPP_PL_PL_Page_03_Image_0003.jpg]]",
    "[[Image:124_BPP_PL_PL_Page_03_Image_0004.jpg]]",
    "[[Image:124_BPP_PL_PL_Page_03_Image_0005.jpg]]",
    "[[Image:124_BPP_PL_PL_Page_03_Image_0006.jpg]]",
    "[[Image:124_BPP_PL_PL_Page_03_Image_0007.jpg]]",
    "[[Image:124_BPP_PL_PL_Page_05_Image_0001.jpg]]",
    "[[Image:124_BPP_PL_PL_Page_05_Image_0002.jpg]]"
    ]


with open ("C:\\124_BPP_PL_PL.txt") as myfile:
    s = ' '.join([line.replace('\n', '') for line in myfile.readlines()])

dest = open('C:\\124_BPP_PL_PL_processed.txt', 'w')

for item in li:
     s = s.replace("[[Image:]]", item, 1)

dest.write(s)
dest.close()

好的，所以我做了一些研究，发现这是编码问题。所以我安装了一个程序Notepad ++，并将我的.txt文件的编码用Wikisource更改为：UTF-8并保存了它。然后我对我的代码进行了一些更改：

with open ("C:\\124_BPP_PL_PL.txt", encoding="utf8') as myfile:
        s = ' '.join([line.replace('\n', '') for line in myfile.readlines()])

但是我收到了这条新的错误消息：

Traceback (most recent call last):
  File "C:\Python33\final.py", line 22, in <module>
    dest.write(s)
  File "C:\Python33\lib\encodings\cp1250.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 0: character maps to <undefined>

而我真的被困在这一个上了。我想，当我在Notepad ++中手动更改编码然后我会告诉我设置的编码 - 一切都会很好。

请提前帮助，谢谢。

Answer 1

当Python 3打开文本文件时，它会在尝试解码文件时使用系统的默认编码，以便为您提供完整的Unicode文本（str类型完全支持Unicode）。写出这样的Unicode文本值时也是如此。

你已经解决了输入方面的问题;你在阅读时指定了一个编码。写时执行相同操作：指定用于写出可处理Unicode的文件的编解码器，包括代码点U + FEFF处的非破坏空白字符。 UTF-8通常是一个很好的默认选择：

dest = open('C:\\124_BPP_PL_PL_processed.txt', 'w', encoding='utf8')

您也可以在撰写时使用with语句并自行保存.close()来电：

for item in li:
     s = s.replace("[[Image:]]", item, 1)

with open('C:\\124_BPP_PL_PL_processed.txt', 'w', encoding='utf8') as dest:        
    dest.write(s)

Python：文件编码错误

1 个答案: