Question

我正在使用mutagen阅读mojibaked ID3标签。我的目标是在学习编码和Python的处理时修复mojibake。

我正在使用的文件有ID3v2标记，我正在查看其专辑（TALB）框架，根据{{1}中的编码字节} ID3帧，以Latin-1（TALB）编码。我知道这个帧中的字节是用ISO-8859-1（西里尔语）编码的。

到目前为止，这是我的代码：

cp1251

现在，正如您所看到的，>>> from mutagen.mp3 import MP3 >>> mp3 = MP3(paths[0]) >>> mp3['TALB'] TALB(encoding=0, text=[u'\xc1\xf3\xf0\xe6\xf3\xe9\xf1\xea\xe8\xe5 \xef\xeb\xff\xf1\xea\xe8'])在此表示为Unicode字符串。然而，它是mojibaked：

mp3['TALB'].text[0]

我很难将这些>>> print mp3['TALB'].text[0] Áóðæóéñêèå ïëÿñêè字节转码为正确的Unicode代码点。到目前为止，我最好的结果是非常不合适的：

cp1251

据我了解这种方法，它的工作原理是因为我最终将Unicode字符串转换为8位字符串，然后我可以将其解码为Unicode，同时指定我正在解码的编码。

问题是我不能直接在>>> st = ''.join([chr(ord(x)) for x in mp3['TALB'].text[0]]); st '\xc1\xf3\xf0\xe6\xf3\xe9\xf1\xea\xe8\xe5 \xef\xeb\xff\xf1\xea\xe8' >>> print st.decode('cp1251') Буржуйские пляски <-- **this is the correct, demojibaked text!**上使用Unicode字符串：

decode('cp1251')

有人可以解释一下吗？在直接在>>> st = mp3['TALB'].text[0]; st u'\xc1\xf3\xf0\xe6\xf3\xe9\xf1\xea\xe8\xe5 \xef\xeb\xff\xf1\xea\xe8' >>> print st.decode('cp1251') Traceback (most recent call last): File "<console>", line 1, in <module> File "/Users/dmitry/dev/mp3_tag_encode_convert/lib/python2.7/encodings/cp1251.py", line 15, in decode return codecs.charmap_decode(input,errors,decoding_table) UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-9: ordinal not in range(128)字符串上操作时，我无法理解如何使其不能解码为7位ascii范围。

Answer 1

首先，将其编码为您已知的编码。

>>> tag = u'\xc1\xf3\xf0\xe6\xf3\xe9\xf1\xea\xe8\xe5 \xef\xeb\xff\xf1\xea\xe8'
>>> raw = tag.encode('latin-1'); raw
'\xc1\xf3\xf0\xe6\xf3\xe9\xf1\xea\xe8\xe5 \xef\xeb\xff\xf1\xea\xe8'

然后你可以用正确的编码解码它。

>>> fixed = raw.decode('cp1251'); print fixed
Буржуйские пляски

使用Python和mutagen去摩擦

1 个答案: