我正在阅读一堆txt.gz文件,但它们有不同的编码(至少UTF-8和cp1252,它们是旧的脏文件)。我尝试在文本模式下阅读之前检测fIn
的编码但我收到错误:TypeError: 'GzipFile' object is not callable
相应的代码:
# detect encoding
with gzip.open(fIn,'rb') as file:
fInEncoding = tokenize.detect_encoding(file) #this doesn't works
print(fInEncoding)
for line in gzip.open(fIn,'rt', encoding=fInEncoding[0], errors="surrogateescape"):
if line.find("From ") == 0:
if lineNum != 0:
out.write("\n")
lineNum +=1
line = line.replace(" at ", "@")
out.write(line)
回溯
$ ./mailmanToMBox.py list-cryptography.metzdowd.com
('Converting ', '2015-May.txt.gz', ' to mbox format')
Traceback (most recent call last):
File "./mailmanToMBox.py", line 65, in <module>
main()
File "./mailmanToMBox.py", line 27, in main
if not makeMBox(inFile,outFile):
File "./mailmanToMBox.py", line 48, in makeMBox
fInEncoding = tokenize.detect_encoding(file.readline()) #this doesn't works
File "/Users/simon/anaconda3/lib/python3.6/tokenize.py", line 423, in detect_encoding
first = read_or_stop()
File "/Users/simon/anaconda3/lib/python3.6/tokenize.py", line 381, in read_or_stop
return readline()
TypeError: 'bytes' object is not callable
EDIT 我尝试使用以下代码:
# detect encoding
readsource = gzip.open(fIn,'rb').__next__
fInEncoding = tokenize.detect_encoding(readsource)
print(fInEncoding)
我没有错误但是它总是返回utf-8,即使它不是。我的文本编辑器(sublime)正确检测到cp1252编码。
答案 0 :(得分:2)
作为detect_encoding()
says的文档,它的输入参数必须是可调用的,提供输入行。这就是为什么你得到TypeError: 'GzipFile' object is not callable
。
import tokenize
with open(fIn, 'rb') as f:
codec = tokenize.detect_encoding(f.readline)[0]
... codec
将是&#34; utf-8&#34;或类似的东西。