这段代码出了什么问题?
>>> from urllib.request import urlopen
>>> for line in urlopen("http://google.com/"):
print(line.decode("utf-8"))
<!doctype html><html><head><meta http-equiv="content-type" content="text/html; charset=windows-1251"><title>Google</title><script>window.google={kEI:"XMECT7XyDcGn0AWFk7ywAQ",getEI:function(a){var b;while(a&&!(a.getAttribute&&(b=a.getAttribute("eid"))))a=a.parentNode;return b||google.kEI},https:function(){return window.location.protocol=="https:"},kEXPI:"33492,35300",kCSI:{e:"33492,35300",ei:"XMECT7XyDcGn0AWFk7ywAQ"},authuser:0,
ml:function(){},kHL:"uk",time:function(){return(new Date).getTime()},log:function(a,b,c,e){var d=new Image,g=google,h=g.lc,f=g.li,j="";d.onerror=(d.onload=(d.onabort=function(){delete h[f]}));h[f]=d;if(!c&&b.search("&ei=")==-1)j="&ei="+google.getEI(e);var i=c||"/gen_204?atyp=i&ct="+a+"&cad="+b+j+"&zx="+google.time(),k=/^http:/i;if(k.test(i)&&google.https()){google.ml(new Error("GLMM"),false,{src:i});
delete h[f];return}d.src=i;g.li=f+1},lc:[],li:0,Toolbelt:{},y:{},x:function(a,b){google.y[a.id]=
[a,b];return false}};
window.google.sn="webhp";window.google.timers={};window.google.startTick=function(a,b){window.google.timers[a]={t:{start:(new Date).getTime()},bfr:!(!b)}};window.google.tick=function(a,b,c){if(!window.google.timers[a])google.startTick(a);window.google.timers[a].t[b]=c||(new Date).getTime()};google.startTick("load",true);try{}catch(u){}
var _gjwl=location;function _gjuc(){var e=_gjwl.href.indexOf("#");if(e>=0){var a=_gjwl.href.substring(e);if(a.indexOf("&q=")>0||a.indexOf("#q=")>=0){a=a.substring(1);if(a.indexOf("#")==-1){for(var c=0;c<a.length;){var d=c;if(a.charAt(d)=="&")++d;var b=a.indexOf("&",d);if(b==-1)b=a.length;var f=a.substring(d,b);if(f.indexOf("fp=")==0){a=a.substring(0,c)+a.substring(b,a.length);b=c}else if(f=="cad=h")return 0;c=b}_gjwl.href="/search?"+a+"&cad=h";return 1}}}return 0}function _gjp(){!(window._gjwl.hash&&
window._gjuc())&&setTimeout(_gjp,500)};
Traceback (most recent call last):
File "<pyshell#109>", line 2, in <module>
print(line.decode("utf-8"))
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc2 in position 2364: invalid continuation byte
答案 0 :(得分:6)
Google会向您发送Windows-1251编码文本,它会在元标记中显示。这将有效:
>>> from urllib.request import urlopen
>>> for line in urlopen("http://google.com/"):
print(line.decode("cp1251"))
答案 1 :(得分:2)
那是你的失败线(最后一部分):
>>> line
b'<a class=gb1 href="http://www.google.es/imghp?hl=es&tab=wi">Im\xe1genes</a>'
>>> line.decode()
Traceback (most recent call last):
File "<pyshell#12>", line 1, in <module>
line.decode()
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position 62: invalid continuation byte
失败的代码来自带有重音的西班牙语单词:
>>> bite = 0xe1
>>> bite
225
>>> chr(225)
'á'
你可以相应地解析拉丁语:
>>> line.decode('latin-1')
'<a class=gb1 href="http://www.google.es/imghp?hl=es&tab=wi">Imágenes</a>'
顺便说一下,Imágenes
是西班牙语images