403错误使用urllib.request通过Python登录Github

时间:2015-03-12 13:44:43

标签: python python-3.x web-scraping python-requests urllib

我在尝试使用Python 3.4.2通过urllib.request登录Github时出现以下错误:

  

urllib.error.HTTPError:HTTP错误403:禁止

我的假设是我没有镜像真实用户(尝试通过添加用户代理来实现)。我可以使用Selenium和PhantomJS,但它有点太笨重而且很慢。

是的,我可以使用Github的API,但重点是学习如何使用urllib登录网站,我只想尝试Github。

以下是代码:

import urllib.request
from bs4 import BeautifulSoup
import urllib.parse


    #Login Info

    user_name = 'USERNAME'
    password = 'PASSWORD'

    #Request the login page to pull the auth token

    soup = urllib.request.urlopen('https://github.com/login')
    soup_content = soup.read()
    pretty_soup = BeautifulSoup(soup_content)

    #Request the auth token

    for tags in pretty_soup.findAll("meta", {'name': 'csrf-token'}):
        auth_token = tags['content']

    #Post login and auth token

    url = 'https://github.com/session'
    user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.76 Safari/537.36'
    values = {'utf8' : '%E2%9C%93',
              'authenticity_token' : auth_token,
              'login' : user_name,
              'password' : password }

    headers = { 'User-Agent' : user_agent }

    data = urllib.parse.urlencode(values)
    binary_data = data.encode('UTF-8')
    req = urllib.request.Request(url, binary_data, headers)
    response = urllib.request.urlopen(req)
    the_page = response.read()
    pretty_page = BeautifulSoup(the_page)

    print(pretty_page)

编辑1:

添加了如何根据请求而不是urllib根据评论中的请求执行此操作。现在的问题是预期的输出是导航到Github上的帐户设置页面,但是,我将302再次重定向到403错误。

我添加了HTTPBasicAuth,因为我可能会遇到身份验证问题。尝试之后,我想也许在get()请求中获取帐户设置页面需要一个cookie,但仍然无效。

这是新代码:

import requests
from bs4 import BeautifulSoup
from requests.auth import HTTPBasicAuth

   #Pull login url content

   user_agent = {'User-agent':'Mozilla/5.0'}
   url = 'https://github.com/login'

   r = requests.get(url,headers=user_agent)

   soup = BeautifulSoup(r.content)

   #Request the auth token

   for tags in soup.findAll("meta", {'name': 'csrf-token'}):
    auth_token = tags['content']

   #Post data   

   user_name = 'USERNAME'
   password = 'PASSWORD'

   session_url = u'https://github.com/session' #Url to post payload to

   payload = {'utf8' : '✓',
             'authenticity_token' : auth_token}

   auth = HTTPBasicAuth(user_name, password)

   with requests.Session() as s:

    #Post the payload and user agent to the session url    
       post_it = s.post(url=session_url, data=payload, auth=auth)
       cookie = post_it.headers['Set-Cookie']
       cookies = dict(cookies_are=cookie)

       # An authorised request.
       g = s.get('https://github.com/settings/profile',cookies=cookies, allow_redirects=True)
       new_soup = BeautifulSoup(g.content)

       #print(new_soup.title.text)

       ####### EXPECTED OUTPUT WOULD BE: 'Your Profile', e.g. the title tag for    https://github.com/settings/profile, NOT 'Sign In'

   #Test issues

   print('check login: \n',
                     user_agent,'\n',
                     url,'\n',
                     auth_token,'\n',
          'post login: \n',
                     session_url,'\n',
                     payload,'\n',
                     user_agent,'\n',
                     post_it,'\n',
                     g.history[0],'\n',
                     new_soup.title.text)

0 个答案:

没有答案