Question

所以我在整个论坛上阅读了无数文章并尝试了大量的建议，但我仍然无法让我的代码按照我的意愿去做，这是：

使用BeautifulSoup（WORKS）
将该文本写入输出文件（WORKS）
使用正确的编码（不工作）。

将文本打印到Stout也可以。但既不将该输出重定向到output.txt，也不将其直接写入python中的该文件似乎有效？那是为什么？

以下是有效的：

    print '$'.join(result).encode('utf-8')

示例输出：

    Bärensteiner Str.

这两个都没有：

    myscript.py > output.txt

也不是：

    with codecs.open('output.txt', 'a', 'utf-8') as outfile:
      outfile.write('$'.join(result))

也不是：

    with open('output.txt', 'a') as outfile:
      outfile.write('$'.join(result).encode('utf-8'))

会奏效。以上所有三个都产生一个output.txt，其中包含以下内容：

    BÃ¤rensteiner Str.

我很茫然并且（很明显）没有正确掌握这种编码和解码的工作方式......无论如何：你们中的任何一个聪明人都知道如何让我的代码正常工作吗？ / p>

Answer 1

公然mojibake案。

您的文件 UTF-8，但它以其他编码显示，例如像

==> python -c print('Bären'.encode('utf-8').decode('Latin1')=='BÃ¤ren')
True

==> python -c print('Bären'=='BÃ¤ren'.encode('Latin1').decode('utf-8'))
True

==>

按Character Encoding Errors Analyzer：

预期结果Bären

实际结果BÃ¤ren

分析！

显示6个结果


utf-8（65001，Unicode（UTF-8）） - ＆gt; Windows-1252（1252，西欧（Windows））

utf-8（65001，Unicode（UTF-8）） - ＆gt; windows-1254（1254，土耳其语（Windows））

utf-8（65001，Unicode（UTF-8）） - ＆gt; iso-8859-1（28591，西欧（ISO））

utf-8（65001，Unicode（UTF-8）） - ＆gt; iso-8859-4（28594，Baltic（ISO））

utf-8（65001，Unicode（UTF-8）） - ＆gt; iso-8859-9（28599，土耳其（ISO））

utf-8（65001，Unicode（UTF-8）） - ＆gt; utf-7（65000，Unicode（UTF-7））

Python：将utf-8写入文件

1 个答案: