我在尝试使用Python 3.4.2通过urllib.request登录Github时出现以下错误:
urllib.error.HTTPError:HTTP错误403:禁止
我的假设是我没有镜像真实用户(尝试通过添加用户代理来实现)。我可以使用Selenium和PhantomJS,但它有点太笨重而且很慢。
是的,我可以使用Github的API,但重点是学习如何使用urllib登录网站,我只想尝试Github。
以下是代码:
import urllib.request
from bs4 import BeautifulSoup
import urllib.parse
#Login Info
user_name = 'USERNAME'
password = 'PASSWORD'
#Request the login page to pull the auth token
soup = urllib.request.urlopen('https://github.com/login')
soup_content = soup.read()
pretty_soup = BeautifulSoup(soup_content)
#Request the auth token
for tags in pretty_soup.findAll("meta", {'name': 'csrf-token'}):
auth_token = tags['content']
#Post login and auth token
url = 'https://github.com/session'
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.76 Safari/537.36'
values = {'utf8' : '%E2%9C%93',
'authenticity_token' : auth_token,
'login' : user_name,
'password' : password }
headers = { 'User-Agent' : user_agent }
data = urllib.parse.urlencode(values)
binary_data = data.encode('UTF-8')
req = urllib.request.Request(url, binary_data, headers)
response = urllib.request.urlopen(req)
the_page = response.read()
pretty_page = BeautifulSoup(the_page)
print(pretty_page)
编辑1:
添加了如何根据请求而不是urllib根据评论中的请求执行此操作。现在的问题是预期的输出是导航到Github上的帐户设置页面,但是,我将302再次重定向到403错误。
我添加了HTTPBasicAuth,因为我可能会遇到身份验证问题。尝试之后,我想也许在get()请求中获取帐户设置页面需要一个cookie,但仍然无效。
这是新代码:
import requests
from bs4 import BeautifulSoup
from requests.auth import HTTPBasicAuth
#Pull login url content
user_agent = {'User-agent':'Mozilla/5.0'}
url = 'https://github.com/login'
r = requests.get(url,headers=user_agent)
soup = BeautifulSoup(r.content)
#Request the auth token
for tags in soup.findAll("meta", {'name': 'csrf-token'}):
auth_token = tags['content']
#Post data
user_name = 'USERNAME'
password = 'PASSWORD'
session_url = u'https://github.com/session' #Url to post payload to
payload = {'utf8' : '✓',
'authenticity_token' : auth_token}
auth = HTTPBasicAuth(user_name, password)
with requests.Session() as s:
#Post the payload and user agent to the session url
post_it = s.post(url=session_url, data=payload, auth=auth)
cookie = post_it.headers['Set-Cookie']
cookies = dict(cookies_are=cookie)
# An authorised request.
g = s.get('https://github.com/settings/profile',cookies=cookies, allow_redirects=True)
new_soup = BeautifulSoup(g.content)
#print(new_soup.title.text)
####### EXPECTED OUTPUT WOULD BE: 'Your Profile', e.g. the title tag for https://github.com/settings/profile, NOT 'Sign In'
#Test issues
print('check login: \n',
user_agent,'\n',
url,'\n',
auth_token,'\n',
'post login: \n',
session_url,'\n',
payload,'\n',
user_agent,'\n',
post_it,'\n',
g.history[0],'\n',
new_soup.title.text)