Python将gibbrish转换为希伯来语

时间:2013-08-17 22:01:15

标签: python unicode character-encoding hebrew

这是我的代码:

# -*- coding: utf-8-*-
array=["à","á","â","ã","ä","å","æ","ç","è","é","ê","ë","ì","í","î","ï","ð","ñ","ó","ô","õ","ö","ø","ù","ú","û","ü","ý","þ","ÿ"]
array1=["א","ב","ג","ד","ה","ו","ז","ח","ט","י","ך","כ","ל","ם","מ","ן","נ","ס","ע","ף","פ","ץ","צ","ק","ר","ש","ת"]
str="áï éäåãä"
message=""
for i in range(0,len(str)):
   s=str[i]
   index=-1
   for j in range(0,len(array)):
       if(array[j]==s):
           index=j
           break
   if(index!=-1):
   message+=array1[index]
   print array1[index]
print message

错误是:

SyntaxError: EOL while scanning string literal
第2行

我有希伯来语的文本文件,但无论编码是什么,它总是以gibbrish显示。这是一个将其转换为希伯来语的python程序。原始文件在IS0-8859-1

2 个答案:

答案 0 :(得分:4)

您使用了',您应该使用"

'ÿ"

表示最后一项:

array=["à","á","â","ã","ä","å","æ","ç","è","é","ê","ë","ì","í","î","ï","ð","ñ","ó","ô","õ","ö","ø","ù","ú","û","ü","ý","þ",'ÿ"]

将单引号设为双倍。

至于你的翻译课程;听起来好像你的文件编码不正确,或者解码不正确。也许您应该找出正确的编码,而不是盲目用希伯来语代码点的UTF-8序列替换Latin-1字节?

如果您使用codec模块使用正确的编解码器打开文件并解码为Unicode,那么您很可能会发现数据已正确编码。

强烈敦促您在继续之前研究Unicode,编解码器和Python:

答案 1 :(得分:3)

正如@Martijn建议的那样,正确解码原始文件将是一个更好的解决方案。如果您的文件是希伯来语但显示array个字符,则可能会显示为latin1cp1252编码。 cp1255看起来很贴心。也许你的array1不太对劲。另请注意,字符串是可迭代的,因此您可以简化数组:

# coding: utf8
array  = u'àáâãäåæçèéêëìíîïðñóôõöøùúûüýþÿ'
array1 = u'אבגדהוזחטיךכלםמןנסעףפץצקרשת'
print(array)
print(array1)
print(array.encode('cp1252').decode('cp1255',errors='replace'))

上面的最后一行反转了“错误”编码,并用cp1255(希伯来语编码)对其进行解码。输出:

àáâãäåæçèéêëìíîïðñóôõöøùúûüýþÿ
אבגדהוזחטיךכלםמןנסעףפץצקרשת
אבגדהוזחטיךכלםמןנסףפץצרשת��‎‏�

这不是一个完美的匹配,但足够接近,我认为你的原始文件是用cp1255编码的。