Question

我从以下文件中读取行：

The Little Big Things：163 Wege zur Spitzenleistung（Dein Leben）（德语版）（彼得斯，汤姆）

Die virtuelle Katastrophe：所以führenSieTeamsüberDistanzzur   Spitzenleistung（德语版）（托马斯，加里）

我用以下内容读取/编码它们：

title = line.encode('utf8')

但输出是：

b'Die virtuelle Katastrophe：所以f \ xc3 \ xbchren Sie Teams \ xc3 \ xbcber   Distanz zur Spitzenleistung（德文版）（Thomas，Gary）'

b'The Little Big Things：163 Wege zur Spitzenleistung（Dein Leben）   （德语版）（彼得斯，汤姆）'

为什么总是添加“b”？如何正确读取文件以便保留“变音符号”？

以下是完整的相关代码段：

# Parse the clippings.txt file
lines = [line.strip() for line in codecs.open(config['CLIPPINGS_FILE'], 'r', 'utf-8-sig')]
for line in lines:
    line_count = line_count + 1
    if (line_count == 1 or is_title == 1):
        # ASSERT: this is a title line
        #title = line.encode('ascii', 'ignore')
        title = line.encode('utf8')
        prev_title = 1
        is_title = 0
        note_type_result = note_type = l = l_result = location = ""
        continue

感谢

Answer 1

方法str.encode将unicode字符串转换为bytes对象：

<强> str.encode(encoding="utf-8", errors="strict")
将字符串的编码版本作为字节对象返回。默认编码为＆＃39; utf-8＆＃39;。可以给出错误以设置不同的错误处理方案。错误的默认值是＆＃39; strict＆＃39;，这意味着编码错误会引发UnicodeError。其他可能的值包括＆＃39;忽略＆＃39;＆＃39;替换＆＃39;，＆＃39; xmlcharrefreplace＆＃39;，＆＃39; backslashreplace＆＃39;以及通过codecs.register_error（）注册的任何其他名称，请参阅错误处理程序部分。有关可能的编码列表，请参阅标准编码部分。

所以你得到的正是预期的。

在大多数计算机上，您只需open个文件并阅读即可。如果文件编码不是系统默认值，则可以将其作为关键字参数传递：

with open(filename, encoding='utf8') as f:
    line = f.readline()

用utf8读取带有.encode的行

1 个答案: