当我使用codecs.open('f.txt', 'r', encoding=None)
打开文件时,Python 2.7.8会选择一些默认编码。
这是什么?这记录在哪里?
Some experimentation已发现默认编码不是utf-8
,ascii
,sys.getdefaultencoding()
,locale.getpreferredencoding()
或locale.getpreferredencoding(False)
。
编辑(澄清我的动机):我想知道当我运行这样的脚本时,Python 2.7.8选择了哪种编码:
f = codecs.open('f.txt', 'r', encoding=None) # or equivalently: f=open('f.txt')
for line in f:
print len(line) # obviously SOME encoding has been chosen if I can print the number of characters
我对猜测文件编码的其他方式不感兴趣。
答案 0 :(得分:3)
它基本上不会进行任何透明编码/解码,只需打开文件并将其返回。
以下是图书馆的代码: -
['image1.png', 'image2.png', 'image3.png']
正如您所看到的,如果encoding是None,它只返回打开的文件。
这是你的文件,每个字节用十进制表示,显示其对应的ascii字符:
def open(filename, mode='rb', encoding=None, errors='strict', buffering=1):
""" Open an encoded file using the given mode and return
a wrapped version providing transparent encoding/decoding.
Note: The wrapped version will only accept the object format
defined by the codecs, i.e. Unicode objects for most builtin
codecs. Output is also codec dependent and will usually be
Unicode as well.
Files are always opened in binary mode, even if no binary mode
was specified. This is done to avoid data loss due to encodings
using 8-bit values. The default file mode is 'rb' meaning to
open the file in binary read mode.
encoding specifies the encoding which is to be used for the
file.
errors may be given to define the error handling. It defaults
to 'strict' which causes ValueErrors to be raised in case an
encoding error occurs.
buffering has the same meaning as for the builtin open() API.
It defaults to line buffered.
The returned wrapped file object provides an extra attribute
.encoding which allows querying the used encoding. This
attribute is only available if an encoding was specified as
parameter.
"""
if encoding is not None:
if 'U' in mode:
# No automatic conversion of '\n' is done on reading and writing
mode = mode.strip().replace('U', '')
if mode[:1] not in set('rwa'):
mode = 'r' + mode
if 'b' not in mode:
# Force opening of the file in binary mode
mode = mode + 'b'
file = __builtin__.open(filename, mode, buffering)
if encoding is None:
return file
info = lookup(encoding)
srw = StreamReaderWriter(file, info.streamreader, info.streamwriter, errors)
# Add attributes to simplify introspection
srw.encoding = encoding
return srw
在ascii中打开它时遇到的问题是十进制值为180的字节.Ascii最多只能达到127.所以这让我觉得这必须是某种扩展的ascii,其中128 - 255用于额外的符号。在仔细阅读了关于ascii(https://en.wikipedia.org/wiki/ASCII)的维基百科文章之后,它提到了一个名为windows-1252的ascii的流行扩展。在windows-1252中,十进制值180映射到急性重音字符(')。然后我决定谷歌你的文件中的字符串,看看它实际上与之相关。这就是我发现“哈佛杯30'”http://www.365chess.com/tournaments/Harvard_Cup_30%C2%B4_1989/21650
的时候所以在夏季,正确的编码可能是windows-1252。这是我的测试程序: -
46 .
46 .
46 .
32 'space'
48 0
45 -
49 1
10 'line feed'
10 'line feed'
91 [
69 E
118 v
101 e
110 n
116 t
32 'space'
34 "
72 H
97 a
114 r
118 v
97 a
114 r
100 d
32 'space'
67 C
117 u
112 p
32 'space'
51 3
48 0
180 'this is not ascii'
34 "
93 ]
10 'line feed'
46 .
46 .
46 .
输出
import codecs
with codecs.open('f.txt', 'r', encoding='windows-1252') as f:
print f.read()
答案 1 :(得分:1)
在读取文件时,使用codecs.open('f.txt','r',encoding=None)
返回字节字符串而不是Unicode字符串。它根本不尝试用编码解码文件数据。它相当于open('f.txt','r')
。您收到的长度是存储在没有翻译的文件中的行中的单个字节数。
一个小例子:
>>> import codecs
>>> codecs.open('f.txt','r',encoding=None).read()
'abc\n'
>>> codecs.open('f.txt','r',encoding='ascii').read() # Note Unicode string returned.
u'abc\r\n'
>>> open('f.txt','r').read()
'abc\n'