Question

解决：问题与Python版本有关，请参阅stackoverflow.com/a/5513856/2540382

我正在摆弄htm -> txt文件转换并遇到一些麻烦。我的项目主要是将我下载的Facebook聊天记录的messages.htm文件转换为messages.txt文件，删除所有<>括号并保留格式。

文件messages.htm被解析为变量text。

然后我跑了：

target = open('output.txt', 'w')
target.write(text)
target.close

除非遇到无效字符，否则这似乎有效。如下面的错误所示。有没有办法：

写作时用无效字符跳过这一行？
找出无效字符的位置并删除相应的字符或行？

期望的结果是尽可能避免将奇怪的字符放在一起。

return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\U000fe333' in position 37524: character
maps to <undefined>

Answer 1

target = open('output.txt', 'wb')
target.write(text.encode('ascii', 'ignore'))
target.close()

对于.encode（..）的“errors”参数，'ignore'将删除这些字符，'replace'将用'？'替换它们。

为了测试这个，我用

替换了写行

target.write(u"foo\U000fe333bar".encode("ascii", "ignore"))

并确认output.txt仅包含“foobar”。

更新：我将open(.., 'w')编辑为open(.., 'wb')，以确保此功能在Python 3中也能正常使用。

Python写入文本文件跳过坏行

1 个答案: