Question

我编写了以下Python 3脚本：

from sys import argv
from os.path import exists

script, from_file, to_file = argv

print(f"Copying from {from_file} to {to_file}")

in_file = open(from_file)
indata = in_file.read()

print(f"The input file is {len(indata)} bytes long")

print(f"Does the output file exist? {exists(to_file)}")
print("Ready, hit RETURN to continue, CTRL-C to abort.")
input()

out_file = open(to_file, 'w')
out_file.write(indata)

print("Alright, all done.")

out_file.close()
in_file.close()

显然len(indata)的输出应为：

The input file is 21 bytes long

但我明白了：

The input file is 46 bytes long

from_file是一个名为test.txt的文件，其中包含文本“This is a test file。”

我仔细检查了test.txt中的文本。我认为差异可能出在计算机上，因为我使用的是Windows而老师没有。

Expected output of the exercise according to Zed

这是我在这里的第一篇文章，我已经尝试找到有关此问题的内容。虽然我发现了一些关于练习17的问题，但我没有发现字节差异。

Answer 1

短版

你得到这个输出，因为文件编码为UTF-16，可能是因为你用来保存它的编辑器在Windows上有这种行为，你没有指定编码来读它，所以Python猜对了。为避免此类问题，您应始终向open函数添加编码参数，无论是读取还是写入：

in_file = open(from_file, encoding='utf-16')
# ...
out_file = open(to_file, 'w', encoding='utf-16')

长版

21是编码为带有终止LF字符（'\n'）的UTF-8时文件中的字节数，没有byte order mark (BOM)。

46是编码为UTF-16时带有终止CR + LF组合（'\r\n'）和BOM（字节顺序标记）的文件中的字节数。

就像我们认为文本只是“文本”一样，它必须以某种方式编码为字节（有关更多信息，请参阅this Q&A）。在Linux上，最广泛遵循的惯例是将UTF-8用于所有内容。在Windows上，UTF-16更常见，但您也可以获得其他编码。

Python的open函数有一个encoding参数，你可以用它来告诉Python你打开的文件是UTF-16，然后你会得到不同的结果：

in_file = open(from_file, encoding='utf-16')

它做了什么呢？好吧，the open function is documented to use locale.getpreferredencoding(False) if you don't specify an encoding，您可以通过输入import locale; locale.getpreferredencoding(False)找到答案。但是，我可以通过告诉您Windows上的首选编码是Windows-1252来节省您的工作量。如果您使用字符串"This is a test file."，将其编码为UTF-16，并将其解码为Windows-1252，您将看到您发现的异常字符串：

>>> line = "This is a test file."
>>> line_bytes = line.encode('utf-16')
>>> line_bytes.decode('windows-1252')
'ÿþT\x00h\x00i\x00s\x00 \x00i\x00s\x00 \x00a\x00 \x00t\x00e\x00s\x00t\x00 \x00f\x00i\x00l\x00e\x00.\x00'

ÿþ是Windows-1252处理BOM的方式。还有一些不太正确的事情，因为len(line_bytes)只有42而不是46，所以我不得不假设其他事情正在进行中。如果您将\r\n添加到原始字符串，则会得到一个46个字符的字符串。

请注意，即使在Linux上，Zed的输出也具有误导性：输入文件长21 Unicode代码点，而不是21字节。它恰好也只是21个字节，因为文件中的所有字符都是UTF-8的ASCII子集（这是Linux上的首选编码，可以编码为每个字符一个字节）。

LPTHW练习17.为什么len（）的输出不是练习所说的？

1 个答案:

短版

长版