我该如何处理重音字母,德语字母和其他字符?

时间:2010-10-20 19:51:20

标签: python unicode

我的python脚本现在正在运行,但我遇到了一些麻烦:

这是输出:

from BeautifulSoup import BeautifulSoup
import urllib

langCode={
    "arabic":"ar", "bulgarian":"bg", "chinese":"zh-CN",
    "croatian":"hr", "czech":"cs", "danish":"da", "dutch":"nl",
    "english":"en", "finnish":"fi", "french":"fr", "german":"de",
    "greek":"el", "hindi":"hi", "italian":"it", "japanese":"ja",
    "korean":"ko", "norwegian":"no", "polish":"pl", "portugese":"pt",
    "romanian":"ro", "russian":"ru", "spanish":"es", "swedish":"sv" }

def setUserAgent(userAgent):
    urllib.FancyURLopener.version = userAgent
    pass

def translate(text, fromLang, toLang):
    setUserAgent("Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008070400 SUSE/3.0.1-0.1 Firefox/3.0.1")
    try:
        postParameters = urllib.urlencode({"langpair":"%s|%s" %(langCode[fromLang.lower()],langCode[toLang.lower()]), "text":text,"ie":"UTF8", "oe":"UTF8"})
    except KeyError, error:
        print "Currently we do not support %s" %(error.args[0])
        return

    page = urllib.urlopen("http://translate.google.com/translate_t", postParameters)
    content = page.read()
    page.close()

    htmlSource = BeautifulSoup(content)
    translation = htmlSource.find('span', title=text )
    return translation.renderContents()


print translate("Good morning to you friend!", "English", "German")
print translate("Good morning to you friend!", "English", "Italian")
print translate("Good morning to you friend!", "English", "Spanish")

Guten Morgen, du Freund!
Buongiorno a te amico!
Buenos días a ti amigo!

如何管理不是基本英文字母的字母?你会怎么建议我解决这个问题?我在想一本字典用另一个字符替换某些链,但我确信Python已经有了这样的东西。包括电池和诸如此类的东西。 :P

感谢。

2 个答案:

答案 0 :(得分:1)

请勿解析http://translate.google.com/translate_t,因为Google会为此目的提供AJAX服务。 translatedText返回的json数据中的ajax.googleapis.com已经是unicode字符串。

import urllib2
import urllib
import sys
import json

LANG={
    "arabic":"ar", "bulgarian":"bg", "chinese":"zh-CN",
    "croatian":"hr", "czech":"cs", "danish":"da", "dutch":"nl",
    "english":"en", "finnish":"fi", "french":"fr", "german":"de",
    "greek":"el", "hindi":"hi", "italian":"it", "japanese":"ja",
    "korean":"ko", "norwegian":"no", "polish":"pl", "portugese":"pt",
    "romanian":"ro", "russian":"ru", "spanish":"es", "swedish":"sv" }

def translate(text,lang1,lang2):
    base_url='http://ajax.googleapis.com/ajax/services/language/translate?'    
    langpair='%s|%s'%(LANG.get(lang1.lower(),lang1),
                      LANG.get(lang2.lower(),lang2))
    params=urllib.urlencode( (('v',1.0),
                       ('q',text.encode('utf-8')),
                       ('langpair',langpair),) )
    url=base_url+params
    content=urllib2.urlopen(url).read()
    try: trans_dict=json.loads(content)
    except AttributeError:
        try: trans_dict=json.load(content)    
        except AttributeError: trans_dict=json.read(content)
    return trans_dict['responseData']['translatedText']

print translate("Good morning to you friend!", "English", "German")
print translate("Good morning to you friend!", "English", "Italian")
print translate("Good morning to you friend!", "English", "Spanish")

产量

Guten Morgen, du Freund!
Buongiorno a te amico!
Buenos días a ti amigo!

答案 1 :(得分:0)

urlopen()返回的标头中解析正确的字符集,并将其作为fromEncoding参数传递给BeautifulSoup构造函数。