我正在尝试创建一个脚本(纯粹用于学习目的),用一些不同的词典翻译给定的单词。我通过两个工作,使用urllib2和beautifulsoup来获取和解析翻译,然后继续谷歌翻译。
我很快发现它返回403禁止错误。添加用户代理会获得翻译,但只能翻译一个单词。为了说明,转到http://translate.google.com/?text=test&sl=en&tl=es,您将获得翻译(在标题为'hps'的类中)和动词,名词和形容词列表。但是使用下面的脚本并且html是不同的,只返回主翻译,并在
中span id=result_box
不能找到动词,名词等。
在这个过程中,以及相当多的谷歌搜索,我意识到现在有一个API - 而不是一个免费的API。我不打算发布任何最终的脚本,也不打算使用它来违反任何TOS,但我现在最感兴趣的是为什么浏览器和urllib等之间存在差异。
为此,我尝试使用用户代理的纯urllib2和机械化 - 如下所示。所以,我的问题是 - 除了用户代理,还有什么区别浏览器和python脚本?我曾尝试使用萤火虫,但没有任何东西跳出来(尽管我是一个菜鸟)。谢谢!
编辑:来自firebug的请求标头,我的脚本在下面。
import mechanize
import re
import cookielib
# Browser
br = mechanize.Browser()
# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
# Browser options
br.set_handle_equiv(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
# Want debugging messages?
br.set_debug_http(True)
br.set_debug_redirects(True)
br.set_debug_responses(True)
# User-Agent (this is cheating, ok?)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
# Open some site, let's pick a random one, the first that pops in mind:
r = br.open('http://translate.google.com/?text=test&sl=en&tl=es')
html = r.read()
match = re.findall(r'verb', html)
print match
萤火虫:
GET /?text=test&sl=en&tl=es HTTP/1.1
Accept text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Charset ISO-8859-1,utf-8;q=0.7,*;q=0.7
Accept-Encoding gzip, deflate
Accept-Language en-us,en;q=0.5
Connection keep-alive
Cookie PREF=ID=298b435815ef8553:U=e7dad4baf65f083b:FF=0:LD=en:CR=2:TM=1327516863:LM=1339428154:S=maktYFZEHXXpMDFg; NID=60=U229h4lzOnjpHyidbhgYecCx72Myp_-XHgupW-R_mWtpuOveDdIOO1uLBq-6ltn-ER15ppJryR7yYOYEhkCfUCl45qNz5aymBQ1CGDHS4UcHu2oIDYAHut0ctnlL76eDW3n7kjOWoz5wNH6NMw
Host translate.google.com
User-Agent Mozilla/5.0 (Windows NT 6.1; WOW64; rv:9.0) Gecko/20100101 Firefox/9.0
脚本:
发送:'GET /?text = test& sl = zh& tl = es HTTP / 1.1 \ r \ nAccept-Encoding: identity \ r \ nHost:translate.google.com \ r \ n \ nConnection: close \ r \ nUser-Agent:Mozilla / 5.0(X11; U; Linux i686; en-US; rv:1.9.0.1)Gecko / 2008071615 Fedora / 3.0.1-1.fc9 Firefox / 3.0.1 \ r \ n \ r \ n' 回复:'HTTP / 1.1 200 OK \ r \ n'标题:日期:星期一,2012年6月11日16:13:42 GMT
标题:到期日:1990年1月1日星期五00:00:00 GMT
header:Cache-Control:no-cache,must-revalidate
标题:Pragma:no-cache
标题:X-Frame-Options:SAMEORIGIN
标题:Content-Type:text / html;字符集= UTF-8
header:Content-Language:zh
标题:Set-Cookie: PREF = ID = 6dd42f2264250d7c:TM = 1333431222:LM = 1339454222:S = k6JXSoGGaAMNmPEo; expires = Wed,11-Jun-2014 16:13:42 GMT;路径= /;域= .google.com
标题:Set-Cookie: NID = 60 = f8czmR413h3sKUGJUUM4PLKl2O7SUtqfW5hss5O54sRKoErf9wIEU4Wu2WCuHzWTJQ3p1Rj7dQv1B4BBmSMY1tmfus7UZGCYFIKaXoKwklZ9tZsr5vds8vvvFjRdZyevn; expires =星期二,2012年12月11日16:13:42 GMT;路径= /;域= .google.com; 仅Http
标题:P3P:CP =“这不是P3P政策!请参阅 http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657 了解更多信息。“
标题:X-Content-Type-Options:nosniff
标题:服务器:HTTP服务器(未知)
标题:X-XSS-Protection:1;模式=块
标题:连接:关闭
答案 0 :(得分:1)
动词,形容词未找到,因为它们是通过AJAX调用加载的。您的机械化浏览器没有javascript。因此无法做任何AJAX。但是,如果您可以查看浏览器的检查器或其他内容,您将看到调用的标题,URL和参数。现在剩下要做的就是模仿电话。
我卷曲了,我收到了JSON回复:
thrustmaster@thrustmaster:~/Temp$ curl 'http://translate.google.com/translate_a/t?client=t&text=test&hl=en&sl=en&tl=es&multires=1&ssel=0&tsel=0&sc=1' -H 'User-Agent: blah'
[[["prueba","test","",""]],[["noun",["prueba","ensayo","test","examen","an�lisis","criterio","toque","ejercicio","tanteo"],[["prueba",["test","proof","evidence","trial","event","race"]],["ensayo",["test","trial","essay","assay","testing","rehearsal"]],["test",["test"]],["examen",["examination","review","exam","test","inspection","quiz"]],["an�lisis",["analysis","test","review","assay","breakdown"]],["criterio",["criterion","judgment","standard","test","view","yardstick"]],["toque",["touch","stroke","test","knock","blast","chime"]],["ejercicio",["exercise","practice","drill","practicing","test","prosecution"]],["tanteo",["score","scoring","trial","test","try","calculation"]]]],["adjective",["de prueba"],[["de prueba",["test","testing","trial","probationary","corrective"]]]],["verb",["probar","comprobar","ensayar","examinar","poner a prueba","experimentar","someter a prueba","interrogar","hacer investigaciones","justificar","graduar"],[["probar",["test","try","prove","taste","try out","sample"]],["comprobar",["check","test","prove","ascertain","make sure","substantiate"]],["ensayar",["test","try","rehearse","try out","assay","essay"]],["examinar",["examine","consider","review","look at","explore","test"]],["poner a prueba",["test","try","try out","prove","tempt","put through his paces"]],["experimentar",["experience","experiment","undergo","experiment with","feel","test"]],["someter a prueba",["test","try out","touch"]],["interrogar",["question","interrogate","examine","cross-examine","ask","test"]],["hacer investigaciones",["test"]],["justificar",["justify","warrant","substantiate","prove","make good","test"]],["graduar",["graduate","grade","calibrate","time","test"]]]]],"en",,[["prueba",[5],1,0,1000,0,1,0]],[["test",4,,,""],["test",5,[["prueba",1000,1,0],["prueba de",0,1,0],["ensayo",0,1,0],["de prueba",0,1,0],["test",0,1,0]],[[0,4]],"test"]],,,[["en"]],5]thrustmaster@thrustmaster:~/Temp$
现在,可能在您的脚本中,您必须从以下网址获取响应:
PS:
如您所说,这可能是TOS问题,以防您计划使用此脚本。它始终是在API上使用的更好选择。您依赖的HTML可以随时更改。