在python 2.x中,我使用了
"shift-jis".decode('shift-jis').encode('utf-8')
但是在python 3.x中没有更多的str.decode()。 python 3.x中的等效代码是什么?
更新:
更具体:
python2函数是
def unzip(file, dir):
zips = zipfile.ZipFile(file)
for info in zips.infolist():
info.filename = info.filename.decode('shift-jis').encode('utf-8')
zips.extract(info,dir)
print(info, filename)
这个函数的等效python3代码是什么?
答案 0 :(得分:4)
更新您的问题:
def unzip(file, directory): # dir is a keyword
with zipfile.ZipFile(file, mode='r') as zips:
zips.printdir()
zips.extractall(directory)
>>> b'\x82\xb3'.decode('shiftjis')
'さ'
>>> b'\x82\xb3'.decode('shift-jis')
'さ'
>>> b'\x82\xb3'.decode('shift_jis')
'さ'
>>> '日本語'.encode('shiftjis')
b'\x93\xfa\x96{\x8c\xea'
>>> b'\x93\xfa\x96{\x8c\xea'.decode('shiftjis')
'日本語'
并在阅读文件时:
with open('shiftjis.txt', 'r', encoding='shiftjis') as file:
# do something with it
了解详情:http://docs.python.org/3.3/library/io.html#i-o-base-classes
不太合理的版本:
with open('shiftjis.txt', 'rb') as file:
string = file.read().decode('shift-jis')
答案 1 :(得分:0)
我自己需要这样做,而这种天真的做法是:
def unzip(file, dir):
zips = zipfile.ZipFile(file)
for info in zips.infolist():
info.filename = info.filename.encode("cp437").decode("shift-jis")
print("Extracting: " + info.filename.encode(sys.stdout.encoding,errors='replace').decode(sys.stdout.encoding))
zips.extract(info,dir)
print("")
ZipFile
似乎在内部将所有文件名视为DOS(代码页437)。与Python 2不同,Python 3在内部将所有字符串存储为某种类型的UTF。因此我们将文件名转换为字节数组,并将原始字节字符串解码为shift-JIS以获取最终文件名。
print
行执行类似操作,但默认编码为stdout
并返回。这可以防止在Windows上发生的错误,因为它的终端几乎不支持Unicode。 (但如果是,则应正确显示名称。)
这适用于几个zip文件,直到bam ......
Traceback (most recent call last):
File "jp\j-unzip.py", line 73, in <module>
unzip(archname,archpath)
File "jp\j-unzip.py", line 68, in unzip
info.filename = info.filename.encode("cp437").decode("shift-jis")
UnicodeDecodeError: 'shift_jis' codec can't decode byte 0x8f in position 28: illegal multibyte sequence
奖金内容!想要解决这个问题需要花一些时间,但问题是一些有效的shift-JIS字符包含反斜杠,ZipFile将其转换为正斜杠!例如,十个在shift-JIS中编码为8F 5C
。这将转换为8F 2F
这是一个非法序列。如果发生错误,以下(可能过于复杂)代码检查此条件,并尝试修复它。但也许还有其他字符会发生这种情况,并且序列是有效的,因此您得到错误的字符而不是错误。 :(
def convert_filename(inname):
err_ctr=0
keep_going = True
trans_filename = bytearray(inname.encode("cp437"))
while keep_going:
keep_going = False
try:
outname = trans_filename.decode("shift-jis")
except UnicodeDecodeError as e:
keep_going = True
if e.args[4]=="illegal multibyte sequence":
p0, p1 = e.args[2], e.args[3]
print("Trying to fix encoding error at positions " + str(p0) +", "+ str(p1) + " caused by shift-jis sequence " + hex(trans_filename[p0]) +", "+ hex(trans_filename[p1]) )
if (trans_filename[p0]>127 and trans_filename[p1] == 0x2f):
trans_filename[p1] = 0x5c
else:
print("Don't know how to fix this error. Quitting. :(")
raise e
err_ctr = err_ctr + 1
print("This is error #" + str(err_ctr) + " for this filename.")
else:
raise e
if err_ctr>50:
print("More than 50 iterations. Are we stuck in an endless loop? Quitting...")
sys.exit(1)
return outname
def unzip(file, dir):
zips = zipfile.ZipFile(file)
for info in zips.infolist():
info.filename = convert_filename(info.filename)
print("Extracting: " + info.filename.encode(sys.stdout.encoding,errors='replace').decode(sys.stdout.encoding))
zips.extract(info,dir)
print("")