我试图通过python通过代理服务器从网站中提取一些信息,到目前为止我这段代码,但它似乎没有工作
import requests
import BeautifulSoup
URL = 'http://proxy.library.upenn.edu/login?url=http://clients1.ibisworld.com/'
session = requests.session()
# This is the form data that the page sends when logging in
login_data = {
'pennkey': "****",
'password': "****",
'submit': 'login',
}
# Authenticate
r = session.post(URL, data=login_data)
doc = BeautifulSoup.BeautifulSoup(r.content)
print doc
编辑:这是打印的内容:
Gorkems-MacBook-Pro:desktop gorkemyurtseven$ python extract.py
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<meta name="HandheldFriendly" content="True" />
<meta name="viewport" content="width=device-width, height=device-height, user-scalable=yes, minimum-scale=.5" />
<title>University of Pennsylvania Libraries Proxy Service - Login</title>
<link href="/public/proxysm.css" media="print, screen" rel="stylesheet" type="text/css" />
<script language="javascript">
function validate(){
var isgoldcard = document.authenticate.pass.value;
var isgoldcardRegxp = /00000/;
if (isgoldcardRegxp.test(isgoldcard) == true)
alert("Authentication is by PennKey only.");
}
</script>
<script type="text/javascript">
var _gaq = _gaq || [];
_gaq.push(['_setAccount', 'UA-982196-4']);
_gaq.push(['_trackPageview']);
(function() {
var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
})();
</script>
<!--[if IE]>
<style>
table, form .limitwidth {width: 252px;}
.holdsubmit {width: 143px;}
</style>
<![endif]-->
</head>
<body onload="document.authenticate.user.focus();">
<div id="logostripe">
<div><a href="http://www.library.upenn.edu/"><img src="/public/librarieslogologin.gif" border="0" alt="Penn Libraries Home" /></a></div>
</div>
<h1>Libraries Proxy Service</h1>
<div id="holder">
<form name="authenticate" action="https://proxy.library.upenn.edu/login" method="post" autocomplete="off">
<div class="limitwidth">
<input type="hidden" name="url" value="http://clients1.ibisworld.com/" />
<script type="text/javascript">
var t = location.search;
t = t.substr(t.indexOf('proxySessionID')+15,t.indexOf('&')-16);
document.cookie="proxySessionID="+escape(t)+"; path=/; domain=.library.upenn.edu";
</script>
<table align="center" cellspacing="0" cellpadding="2" border="0">
<tr>
<td class="holdlabels"><label for="user">PennKey:</label></td>
<td><input type="text" name="user" /></td>
</tr>
<tr>
<td><label for="password">Password:</label></td>
<td><input type="password" name="pass" onblur="validate(); return false;" /></td>
</tr>
<tr>
<td></td>
<td class="holdsubmit">
<div><input type="submit" value="Login" /></div>
</td>
</tr>
</table>
</div>
</form>
<ul class="moreinfo">
<li><a class="menuitem" href="http://www.upenn.edu/computing/pennkey">PennKey information</a></li>
</ul>
<div class="notices">
The Library Proxy Service allows you to use
domain-restricted resources & services by authenticating yourself as Penn Faculty,
Student, or Staff.
</div>
<div class="alert">
Please note limitations on the use of restricted online resources.
<br /><br />
PennKey holders must be current faculty, student, or staff, have valid University PennCommunity credentials and abide by stated <a href="http://www.library.upenn.edu/policies/appropriate-use-policy.html">Restrictions On Use</a>.
<br /><br />
In addition, users agree to the <a href="http://www.upenn.edu/computing/policy/aup.html">University's Appropriate Use Policy</a>.
</div>
</div><!-- close holder -->
</body>
</html>
答案 0 :(得分:0)
这是一个适合我的解决方案(也使用Penn的代理服务器):
import requests
from bs4 import BeautifulSoup
proxies = {'https': 'https://proxy.library.upenn.edu'}
auth = requests.auth.HTTPProxyAuth('[username]', '[password]')
r = requests.get('http://www.example.com/', proxies=proxies, auth=auth)
print BeautifulSoup(r.content)
第一个关键是代理服务器是https
,而不是http
(这花了太长时间才弄明白)。接下来,您必须使用requests.auth.HTTPProxyAuth
方法对服务器进行身份验证。有一次,您设置了这两个变量,但是,您应该能够在任何需要的地方导航。