通过Python请求登录网站

时间:2016-04-16 14:10:43

标签: python login web-scraping python-requests

对于大学项目我目前正在尝试登录网站,并从我的用户个人资料中删除一些细节(新闻文章列表)。

我是Python的新手,但我之前在其他网站上做过这个。我的前两种方法提供了不同的HTTP错误。我已经考虑过我的请求发送的标头问题,但是我对这个网站登录过程的理解似乎不够。

这是登录页面:http://seekingalpha.com/account/login

我的第一种方法是这样的:

import requests

with requests.Session() as c:
    requestUrl ='http://seekingalpha.com/account/orthodox_login'

    USERNAME = 'XXX'
    PASSWORD = 'XXX'

    userAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36'

    login_data = {
        "slugs[]":None,
        "rt":None,
        "user[url_source]":None,
        "user[location_source]":"orthodox_login",
        "user[email]":USERNAME,
        "user[password]":PASSWORD
        }

    c.post(requestUrl, data=login_data, headers = {"referer": "http://seekingalpha.com/account/login", 'user-agent': userAgent})

    page = c.get("http://seekingalpha.com/account/email_preferences")
    print(page.content)

这导致“403 Forbidden”

我的第二种方法是这样的:

from requests import Request, Session

requestUrl ='http://seekingalpha.com/account/orthodox_login'

USERNAME = 'XXX'
PASSWORD = 'XXX'

userAgent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36'

# c.get(requestUrl) 
login_data = {
    "slugs[]":None,
    "rt":None,
    "user[url_source]":None,
    "user[location_source]":"orthodox_login",
    "user[email]":USERNAME,
    "user[password]":PASSWORD
    }
headers = {
    "accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept-Language":"de-DE,de;q=0.8,en-US;q=0.6,en;q=0.4",
    "origin":"http://seekingalpha.com",
    "referer":"http://seekingalpha.com/account/login",
    "Cache-Control":"max-age=0",
    "Upgrade-Insecure-Requests":1,
    "user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36"
    }

s = Session()
req = Request('POST', requestUrl, data=login_data, headers=headers)

prepped = s.prepare_request(req)
prepped.body ="slugs%5B%5D=&rt=&user%5Burl_source%5D=&user%5Blocation_source%5D=orthodox_login&user%5Bemail%5D=XXX%40XXX.com&user%5Bpassword%5D=XXX"

resp = s.send(prepped)

print(resp.status_code)

在这种方法中,我试图准备好标题,就像我的浏览器一样。抱歉冗余。这会导致HTTP错误400。

有人有想法,出了什么问题吗?可能很多。

1 个答案:

答案 0 :(得分:2)

我建议您不要花大量精力手动登录和使用Session,而是建议您使用Cookie立即抓取页面。

当您登录时,通常会在您的请求中添加一个Cookie来识别您的身份。请看这个例子:

My cookie

您的代码将是这样的:

import requests
response = requests.get("www.example.com", cookies={
                        "c_user":"my_cookie_part",
                        "xs":"my_other_cookie_part"
                        })
print response.content
相关问题