Question

我正在尝试从有关滚动词列表中读取并将所有＆gt; = 8个字符的单词写入新文件。

这是代码 -

def main():
    with open("rockyou.txt", encoding="utf8") as in_file, open('rockout.txt', 'w') as out_file:
        for line in in_file:
            if len(line.rstrip()) < 8:
                continue
            print(line, file = out_file, end = '')
        print("done")

if __name__ == '__main__':
    main()

有些词不是utf-8。

Traceback (most recent call last): File "wpa_rock.py", line 10, in <module> main() File "wpa_rock.py", line 6, in main print(line, file = out_file, end = '') File "C:\Python\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\u0e45' in position 0: character maps to <undefined>

更新

def main():
with open("rockyou.txt", encoding="utf8") as in_file, open('rockout.txt', 'w', encoding="utf8") as out_file:
    for line in in_file:
        if len(line.rstrip()) < 8:
            continue
        out_file.write(line)
    print("done")

if __name__ == '__main__':
    main()```

Traceback (most recent call last): File "wpa_rock.py", line 10, in <module> main() File "wpa_rock.py", line 3, in main for line in in_file: File "C:\Python\lib\codecs.py", line 321, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 933: invali d continuation byte

Answer 1

写到UnicodeEncodeError: 'charmap'（out_file）期间发生print()错误。

默认情况下，open()使用的locale.getpreferredencoding()是Windows上的ANSI代码页（例如cp1252），它不能代表所有Unicode字符，特别是'\u0e45'字符。 cp1252是一个单字节编码，最多可以表示256个不同的字符，但有一百万（1114111）个Unicode字符。它不能代表所有人。

传递代表所有所需数据的encoding，例如，encoding='utf-8'必须有效（如@robyschek suggested） - 如果您的代码读取utf-8数据而没有任何错误，那么代码应该能够使用utf-8来编写数据。

在阅读UnicodeDecodeError: 'utf-8'（in_file）期间发生了for line in in_file错误。并非所有字节序列都是有效的utf-8，例如，os.urandom(100).decode('utf-8')可能会失败。该怎么做取决于应用程序。

如果您希望将文件编码为utf-8;您可以传递errors="ignore" open()参数，以忽略偶尔出现的无效字节序列。或者您可以使用some other error handlers depending on your application。

如果文件中使用的实际字符编码不同，则应传递实际的字符编码。 bytes本身没有任何编码 - 元数据应来自其他来源（尽管some encodings are more likely than others: chardet can guess），例如，如果文件内容是http正文，请参阅A good way to get the charset/encoding of an HTTP response in Python

有时，破坏的软件可以生成大多数utf-8字节序列，其中一些字节采用不同的编码。 bs4.BeautifulSoup can handle some special cases。您也可以try ftfy utility/library查看它是否对您的情况有帮助，例如ftfy may fix some utf-8 variations。

Answer 2

嘿，我遇到了类似的问题，对于rockyou.txt单词列表，我尝试了Python必须提供的多种编码，然后发现encoding = 'kio8_u'可以读取文件。

读取文件时出现UnicodeEncodeError

2 个答案: