使用Python登录到棘手的网站

时间:2019-02-22 15:03:25

标签: python web-scraping python-requests

我在数字营销领域担任数据分析师。我部门使用第三方来帮助吸引更多客户。每个第三方都有一个网站,在其中显示了他们带入我们公司的客户数量。我的工作之一是从每个网站收集数字并将其放入报告中,这是一个漫长而手动的过程。到目前为止,我已经成功登录了一些第三方网站并提取了一些数据。但是,有一个网站无法登录... https://inspire.flg360.co.uk/SignIn.php 。我还需要将会话重定向到另一个URL,以从中抓取数据。

我编写了一些成功登录到我需要信息的网站的代码。

import requests
from bs4 import BeautifulSoup
import re

username = 'username'
password = 'password'
scrape_url = 'https://portal.mvfglobal.com/index.php/dashboard'

login_url = 'https://portal.mvfglobal.com/index.php/login/login'
login_info = {'login_name': username, 'login_pass': password}

#Start session.
session = requests.session()

#Login using your authentication information.
session.post(url=login_url, data=login_info)

#Request page you want to scrape.
url = session.get(url=scrape_url)

soup = BeautifulSoup(url.content, 'html.parser')

print(soup)

但是,当我尝试使用相同的方法登录 https://inspire.flg360.co.uk/SignIn.php 时,遇到了一些问题。

import requests
from bs4 import BeautifulSoup

username = 'username'
password = 'password'
login_url = 'https://inspire.flg360.co.uk/SignIn.php'
login_info = {'strEmail': username, 'strPassword': password}

scrape_url = 'https://inspire.flg360.co.uk/AuthUser.php'

#Start session.
session = requests.session()
#Login using your authentication information.
session.post(url=login_url, data=login_info)
#Request page you want to scrape.
url = session.get(url=scrape_url)

soup = BeautifulSoup(url.content, 'html.parser')

print(soup)

当我检查页面的元素时,我注意到302响应重定向到https://inspire.flg360.co.uk/AuthUser.php。但是,当我尝试使用上面的代码登录时,仍然出现错误。

我完全没有任何想法吗?

下面的最终代码________________________________________________________

import requests
from bs4 import BeautifulSoup
import hashlib

username = 'username'
password = 'password'
login_url = 'https://inspire.flg360.co.uk/AuthUser.php'
login_info = {"strForwardURL": "",
              "strEmail": username,
              "intRememberMe": 1,
              "strResponse": ""}

scrape_url = 'https://inspire.flg360.co.uk/ma/index.php'

# Start session.
session = requests.session()

# Get strResponse
strc = session.get(url=login_url)
strc = BeautifulSoup(strc.content, 'html.parser').findAll(attrs={"name": "strChallenge"})[0]['value']
strc_joined = strc + hashlib.md5(password.encode("utf-8")).hexdigest()
strresponse = hashlib.md5(strc_joined.encode("utf-8")).hexdigest()
login_info['strResponse'] = strresponse

#Login using your authentication information.
session.post(url=login_url, data=login_info)

# Request page you want to scrape.
url = session.get(url=scrape_url)

soup = BeautifulSoup(url.content, 'html.parser')

print(soup)

1 个答案:

答案 0 :(得分:3)

看来,页面https://inspire.flg360.co.uk/SignIn.php上发送的实际POST请求中还包含一些其他元素。即,POST数据实际上看起来像:

strForwardURL=&strEmail=abc%40123.com&intRememberMe=1&strResponse=fdb4c46c5d0eeab6133be193afc7897e

字段是strForwardURLstrEmailintRememberMestrResponse。查看页面上的其余代码,当您单击“提交”按钮时,它将触发页面上的以下JavaScript代码:

    function fncSignIn() {

        var loginForm = document.getElementById("signinForm");

        if (loginForm.strEmail.value == "") {

            alert("Please enter your email address.");
            return false;

        }

        if (loginForm.strPassword.value == "") {

            alert("Please enter your password.");
            return false;

        }

        var submitForm = document.getElementById("submitForm");

        submitForm.strEmail.value = loginForm.strEmail.value;
        if (loginForm.intRememberMe.checked) submitForm.intRememberMe.value = 1;
        submitForm.strResponse.value = hex_md5(loginForm.strChallenge.value+hex_md5(loginForm.strPassword.value));

        submitForm.submit();

    }

在页面的其他位置,您可以在此处找到strChallenge字符串:

<input type="hidden" name="strChallenge" value="1d989603e448a1a0559f08bdc83a15522fbc6c0404ca66acc4cdd7aafe4039359e2fb23b706d60a3">

(顺便说一下,此值在重新加载时会更改)

从本质上讲,它要求的是strChallenge字符串的md5十六进制摘要和密码的md5十六进制摘要,而不是字符串形式的密码。

在python中,将是这样的:

import hashlib
password = "abcdefg12345"
strc = "1d989603e448a1a0559f08bdc83a15522fbc6c0404ca66acc4cdd7aafe4039359e2fb23b706d60a3"
strc_joined = strc + hashlib.md5(password.encode("utf-8")).hexdigest()
strresponse = hashlib.md5(strc_joined.encode("utf-8")).hexdigest()
print(strresponse)

在此示例中,输出为0d289f39067a25430d4818fe38046372

将原始请求中的后数据放入:

{"strForwardURL":"", "strEmail":"abc@123.com", "intRememberMe": 1, "strResponse": "0d289f39067a25430d4818fe38046372"},您应该可以登录。每次您想要抓取需要此特定登录信息的页面时,您都应该可以通过BeautifulSoup4轻松抓取strChallenge,计算出适当的strResponse,然后登录。