Question

我正在使用urllib2打开一个网址并打印html页面，但是当页面开始加载时，首先会有一个javascript，并且有这个：

<body onload="challenge();">

加载脚本然后加载真实页面，但是当我以这种方式打印时：

response = urllib2.urlopen(site)
html = response.read()
print "Get all data: ", html

这是日志：

Get all data:  <html>
<body onload="challenge();">
<script>
eval(function(p,a,c,k,e,r){e=function(c){return c.toString(a)};if(!''.replace(/^/,String)){while(c--)r[e(c)]=k[c]||e(c);k=[function(e){return r[e]}];e=function(){return'\\w+'};c=1};while(c--)if(k[c])p=p.replace(new RegExp('\\b'+e(c)+'\\b','g'),k[c]);return p}('1 6(){2.3=\'4=5; 0-7=8; 9=/\';a.b.c()}',13,13,'max|function|document|cookie|website|455b33285501836b3483c1554b8d8c51586bd800|challenge|age|1600|path|window|location|reload'.split('|'),0,{}))
</script>
</body>
</html>

html打印只有javascript而不是最终页面，有没有办法打印完整的页面？

Answer 1

当您解压缩javascript（即执行eval）时，您会得到：

function challenge()
{
    document.cookie='website=455b33285501836b3483c1554b8d8c51586bd800; max-age=1600; path=/';
    window.location.reload()
}

因此，您需要在发出HTTP请求时发送该cookie以获取实际页面。类似的东西：

import re
import urllib2

opener = urllib2.build_opener()
page = opener.open("<the target URL>").read()
code = re.search('website\|([^|]+)\|',page).group(1)

opener.addheaders.append(('Cookie', 'website='+code+'; max-age=1600; path=/'))
r = opener.open("<the target URL>")
print r.read()

其中1）获取初始页面，2）提取cookie代码，3）使用正确的cookie再次请求页面。

使用urlopen访问带有javascript打包器的页面

1 个答案: