python:相同的角色,不同的行为

时间:2015-06-25 18:09:25

标签: python string unicode decode encode

我使用Python 2.7.9从postgres数据库中提取的列表中生成文件名。在此列表中,有一些带有特殊字符的单词。通常我使用''.join()来记录名称并将其激发给我的加载器,但我只有一个名称需要被识别。 .py设置为utf-8编码,但是这些单词是葡萄牙语,我认为是latin-1编码。

from pydub import AudioSegment
from pydub.playback import play
templist = ['+ Orégano','- Búfala','+ Rúcola']
count_ins = (len(templist)-1)
while (count_ins >= 0 ):
    kot_istructions = AudioSegment.from_ogg('/home/effe/voice_orders/Voz/'+"".join(templist[count_ins])+'.ogg')
    count_ins-=1
    play(kot_istructions)

加载前两个文件:

/home/effe/voice_orders/Voz/+ Orégano.ogg

/home/effe/voice_orders/Voz/- Búfala.ogg

第三个应该是:

/home/effe/voice_orders/Voz/+ Rúcola.ogg

但是python正在尝试加载

/home/effe/voice_orders/Voz/+ R\xc3\xbacola.ogg

为什么只是这一个?我尝试使用normalize()删除重音但由于这是一个字符串,因此该方法无法正常工作。 打印效果很好,因为db更新。只是文件名创建不能按预期工作。 建议?

2 个答案:

答案 0 :(得分:1)

It seems the root cause might be that the encoding of these names in inconsisitent within your database. If you run: >>> 'R\xc3\xbacola'.decode('utf-8') You get u'R\xfacola' which is in fact a Python unicode, correctly representing the name. So, what should you do? Although it's a really unclean programming style, you could play .encode()/.decode() whackamole, where you try to decode the raw string from your db using utf-8, and failing that, latin-1. It would look something like this: try: clean_unicode = dirty_string.decode('utf-8') except UnicodeDecodeError: clean_unicode = dirty_string.decode('latin-1') As a general rule, always work with clean unicode objects within your own source, and only convert to an encoding on saving it out. Also, don't let people insert data into a database without specifying the encoding, as that will stop you from having this problem in the first place. Hope that helps!

答案 1 :(得分:0)

已解决:文件存在问题。删除并再次构建它可以完成这项工作。