我正在通过Python Challenge努力学习Python。在这些挑战中,抓取页面源可能非常有益。但是,我没有收到我在Windows机器上使用Python-Requests包时期望的页面源。
代码:
# Using requests
import requests
url = "http://www.pythonchallenge.com/pc/def/ocr.html"
r = requests.get(url)
print(r.text)
我的回复(为便于阅读而格式化):
<html>
<head>
<META HTTP-EQUIV="Pragma" CONTENT="no-cache">
<META HTTP-EQUIV="Expires" CONTENT="-1">
</head>
<body>
<script>
*Copyright (c) 2010 John Resig, http://jquery.com/
*Permission is hereby granted, free of charge, to any person obtaininga copy
*of this software and associated documentation files //(the"Software"), to deal
*in the Software without restriction, including without limitation the rights to
* use, copy, modify,\tmerge, //publish,distribute, sublicense, and/or sell copies
*of the Software, and to permit persons to whom the Software is furnished to do so,
*subject //to the following conditions: The above copyright notice and this permission notice shall be included in a
*ll copies or substantial portions of the Software.
var keyString = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz012
3456789+/=";
function uTF8Encode(string)
{
string = string.replace(/x0dx0a/g, "x0a");
var output = "";
for (var n = 0; n < string.length; n++) {
var c = string.charCodeAt(n);
if (c < 128) {
output += String.fromCharCode(c);
else if ((c > 127) && (c < 2048)) {
output += String.fromCharCode((c >> 6) | 192);
output += String.fromCharCode((c & 63) | 128);
else {
output += String.fromCharCode((c >> 12) | 224);
output += String.fromCharCode(((c >> 6) & 63) | 128);
output += String.fromCharCode((c & 63) | 128);
}
}
return output;
}
function base64Encode(input)
{
var output = "";
var chr1, chr2, chr3, enc1, enc2, enc3, enc4;
var i = 0;
input = uTF8Encode(input);
while (i < input.length) {
chr1 = input.charCodeAt(i++);
chr2 = input.charCodeAt(i++);
chr3 = input.charCodeAt(i++);
enc1 = chr1 >> 2;
enc2 = ((chr1 & 3) << 4) | (chr2 >> 4);
enc3 = ((chr2 & 15) << 2) | (chr3 >> 6);
enc4 = chr3 & 63;
if (isNaN(chr2)) {
enc3 = enc4 = 64;
}
else if (isNaN(chr3)) {
enc4 = 64;
}
output = output + keyString.charAt(enc1) + keyString.charAt(enc2) + keyString.charAt(enc3) + keyString.charAt(enc4);
}
return output;
}
window.top.location.href = 'https://205.159.94.140/connect/Access?AgentCode=000&url=' + base64Encode(window.top.location.href) + '&cti=';
</script>
</body>
</html>
相关网页:http://www.pythonchallenge.com/pc/def/ocr.html
我收到的回复与我查看浏览器时收到的来源不符。此外,当我在我的另一台机器(OS X)上运行我的代码时,我可以抓住页面源。为什么会这样?
答案 0 :(得分:1)
您正在获取不同的响应,因为您的请求被路由到Windows机顶盒(不同的)本地网络上的代理(请注意底部的重定向:window.top.location.href = 'https://205.159.94.140/connect/Access?AgentCode=000...
)。< / p>
您可以将代理的网络配置传递给requests.get()
来电,以防止重定向。来自the documentation:
import requests
proxies = {
"http": "http://user:pass@205.159.94.140",
"https": "http://user:pass@205.159.94.140"
}
requests.get("http://example.org", proxies=proxies)
根据您的意见,您的LAN设置中应找到实际的代理IP地址和身份验证信息。